[Bug target/54829] New: bad optimization: sub followed by cmp w/ zero (x86 & ARM)

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/54829] New: bad optimization: sub followed by cmp w/ zero (x86 & ARM)
@ 2012-10-05 23:16 daniel.santos at pobox dot com
  2012-10-05 23:40 ` [Bug target/54829] " pinskia at gcc dot gnu.org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: daniel.santos at pobox dot com @ 2012-10-05 23:16 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54829

             Bug #: 54829
           Summary: bad optimization: sub followed by cmp w/ zero (x86 &
                    ARM)
    Classification: Unclassified
           Product: gcc
           Version: 4.7.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: daniel.santos@pobox.com


I originally posted this under bug #3507 but have since discovered that it is
target-specific and is a separate issue than bug #3507.

extern print_gt(void);
extern print_lt(void);
extern print_eq(void);

void cmp_and_branch(long a, long b)
{
    long diff = a - b;

    if (diff > 0) {
        print_gt();
    } else if (diff < 0) {
        print_lt();
    } else {
        print_eq();
    }
}

Here, result of the subtraction is directly used in the branch code and nowhere
else.  However, gcc -O2 -S still generates this output:

cmp_and_branch:
.LFB0:
    .cfi_startproc
    subq    %rsi, %rdi
    cmpq    $0, %rdi
    jg    .L5
    jne    .L6
    jmp    print_eq
    .p2align 4,,10
    .p2align 3
.L5:
    jmp    print_gt
    .p2align 4,,10
    .p2align 3
.L6:
    jmp    print_lt
    .cfi_endproc

Notice that we're using subq followed by cmpq instead of just cmpq %rsi, %rdi. 
In another case, where there is a loop and one of the values compared against
remains the same, an additional mov instruction is required to prevent the
unchanging value's register from being destroyed, so it actually generates two
extra instructions in that situation.

When built on ARM, we get something similar:
cmp_and_branch:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    rsb    r1, r1, r0
    cmp    r1, #0
    bgt    .L5
    bne    .L6
    b    print_eq
.L5:
    b    print_gt
.L6:
    b    print_lt

Note here that we do rsb followed by cmp with zero again.  However, on PPC
(apinski from freenode compiled this for me), the result is actually correct:

subf. 9,4,3
bgt 0,.L5
bne 0,.L6
print_eq

Finally, on MIPS (also from apinski):
dsubu $4,$4,$5
bgtz $4,$L5
nop


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/54829] bad optimization: sub followed by cmp w/ zero (x86 & ARM)
  2012-10-05 23:16 [Bug target/54829] New: bad optimization: sub followed by cmp w/ zero (x86 & ARM) daniel.santos at pobox dot com
@ 2012-10-05 23:40 ` pinskia at gcc dot gnu.org
  2012-10-06 13:31 ` [Bug target/54829] bad optimization: sub followed by cmp w/ zero (ARM) hjl.tools at gmail dot com
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2012-10-05 23:40 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54829

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2012-10-05
     Ever Confirmed|0                           |1
           Severity|normal                      |enhancement

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> 2012-10-05 23:39:56 UTC ---
Confirmed:
Trying 7 -> 8:
Failed to match this instruction:
(set (reg:CC 17 flags)
    (compare:CC (minus:DI (reg/v:DI 60 [ a ])
            (reg/v:DI 61 [ b ]))
        (const_int 0 [0])))


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/54829] bad optimization: sub followed by cmp w/ zero (ARM)
  2012-10-05 23:16 [Bug target/54829] New: bad optimization: sub followed by cmp w/ zero (x86 & ARM) daniel.santos at pobox dot com
  2012-10-05 23:40 ` [Bug target/54829] " pinskia at gcc dot gnu.org
@ 2012-10-06 13:31 ` hjl.tools at gmail dot com
  2012-10-06 15:56 ` [Bug target/54829] bad optimization: sub followed by cmp w/ zero (x86 & ARM) daniel.santos at pobox dot com
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: hjl.tools at gmail dot com @ 2012-10-06 13:31 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54829

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|bad optimization: sub       |bad optimization: sub
                   |followed by cmp w/ zero     |followed by cmp w/ zero
                   |(x86 & ARM)                 |(ARM)

--- Comment #2 from H.J. Lu <hjl.tools at gmail dot com> 2012-10-06 13:31:00 UTC ---
i386.md has

(define_insn "*cmp<mode>_minus_1"
  [(set (reg FLAGS_REG)
        (compare
          (minus:SWI (match_operand:SWI 0 "nonimmediate_operand" "<r>m,<r>")
                     (match_operand:SWI 1 "<general_operand>" "<r><i>,<r>m"))
          (const_int 0)))]
  "ix86_match_ccmode (insn, CCGOCmode)"
  "cmp{<imodesuffix>}\t{%1, %0|%0, %1}"
  [(set_attr "type" "icmp")
   (set_attr "mode" "<MODE>")])

Since cmp will also set OF and CF flags, it can't be used for GT and LE
which checks OF.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/54829] bad optimization: sub followed by cmp w/ zero (x86 & ARM)
  2012-10-05 23:16 [Bug target/54829] New: bad optimization: sub followed by cmp w/ zero (x86 & ARM) daniel.santos at pobox dot com
  2012-10-05 23:40 ` [Bug target/54829] " pinskia at gcc dot gnu.org
  2012-10-06 13:31 ` [Bug target/54829] bad optimization: sub followed by cmp w/ zero (ARM) hjl.tools at gmail dot com
@ 2012-10-06 15:56 ` daniel.santos at pobox dot com
  2012-10-06 15:57 ` daniel.santos at pobox dot com
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: daniel.santos at pobox dot com @ 2012-10-06 15:56 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54829

Daniel Santos <daniel.santos at pobox dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|bad optimization: sub       |bad optimization: sub
                   |followed by cmp w/ zero     |followed by cmp w/ zero
                   |(ARM)                       |(x86 & ARM)

--- Comment #3 from Daniel Santos <daniel.santos at pobox dot com> 2012-10-06 15:55:47 UTC ---
Sure it can:

extern print_gt(void);
extern print_lt(void);
extern print_eq(void);

void cmp_and_branch(long a, long b)
{
    if (a > b) {
        print_gt();
    } else if (a < b) {
        print_lt();
    } else {
        print_eq();
    }
}

Here's x86_64:

cmp_and_branch:
.LFB0:
    .cfi_startproc
    cmpq    %rsi, %rdi
    jg  .L5
    jl  .L6
    jmp print_eq
    .p2align 4,,10
    .p2align 3
.L6:
    jmp print_lt
    .p2align 4,,10
    .p2align 3
.L5:
    jmp print_gt
    .cfi_endproc

And here's i386:

cmp_and_branch:
.LFB0:
    .cfi_startproc
    movl    8(%esp), %eax
    cmpl    %eax, 4(%esp)
    jg  .L5
    jl  .L6
    jmp print_eq
    .p2align 2,,3
.L6:
    jmp print_lt
    .p2align 2,,3
.L5:
    jmp print_gt
    .cfi_endproc


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/54829] bad optimization: sub followed by cmp w/ zero (x86 & ARM)
  2012-10-05 23:16 [Bug target/54829] New: bad optimization: sub followed by cmp w/ zero (x86 & ARM) daniel.santos at pobox dot com
                   ` (2 preceding siblings ...)
  2012-10-06 15:56 ` [Bug target/54829] bad optimization: sub followed by cmp w/ zero (x86 & ARM) daniel.santos at pobox dot com
@ 2012-10-06 15:57 ` daniel.santos at pobox dot com
  2012-10-13 16:05 ` rearnsha at gcc dot gnu.org
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: daniel.santos at pobox dot com @ 2012-10-06 15:57 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54829

--- Comment #4 from Daniel Santos <daniel.santos at pobox dot com> 2012-10-06 15:57:15 UTC ---
Please help me out here if I am missing something.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/54829] bad optimization: sub followed by cmp w/ zero (x86 & ARM)
  2012-10-05 23:16 [Bug target/54829] New: bad optimization: sub followed by cmp w/ zero (x86 & ARM) daniel.santos at pobox dot com
                   ` (3 preceding siblings ...)
  2012-10-06 15:57 ` daniel.santos at pobox dot com
@ 2012-10-13 16:05 ` rearnsha at gcc dot gnu.org
  2012-10-13 16:18 ` rearnsha at gcc dot gnu.org
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rearnsha at gcc dot gnu.org @ 2012-10-13 16:05 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54829

--- Comment #5 from Richard Earnshaw <rearnsha at gcc dot gnu.org> 2012-10-13 16:04:55 UTC ---
The result of the comparison is used in more than one instruction, so combine
cannot safely rework the branch instructions that follow to ensure that the
result of the subtract is used correctly.

In fact, on ARM there is no branch instruction that can be used for "> 0" as a
side effect of a subtract.  To get the desired effect the code would have to be
completely re-arranged to factor out the "< 0" (bmi) and then "== 0" (beq)
cases first.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/54829] bad optimization: sub followed by cmp w/ zero (x86 & ARM)
  2012-10-05 23:16 [Bug target/54829] New: bad optimization: sub followed by cmp w/ zero (x86 & ARM) daniel.santos at pobox dot com
                   ` (4 preceding siblings ...)
  2012-10-13 16:05 ` rearnsha at gcc dot gnu.org
@ 2012-10-13 16:18 ` rearnsha at gcc dot gnu.org
  2012-11-15 21:56 ` daniel.santos at pobox dot com
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rearnsha at gcc dot gnu.org @ 2012-10-13 16:18 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54829

--- Comment #6 from Richard Earnshaw <rearnsha at gcc dot gnu.org> 2012-10-13 16:18:03 UTC ---
Note also that flag setting behaviour of the PPC instruction essentially is a
comparison of the result against zero.  On ARM the flags are set as if the two
input operands were compared; that's not the same as comparing the result with
zero.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/54829] bad optimization: sub followed by cmp w/ zero (x86 & ARM)
  2012-10-05 23:16 [Bug target/54829] New: bad optimization: sub followed by cmp w/ zero (x86 & ARM) daniel.santos at pobox dot com
                   ` (5 preceding siblings ...)
  2012-10-13 16:18 ` rearnsha at gcc dot gnu.org
@ 2012-11-15 21:56 ` daniel.santos at pobox dot com
  2013-08-05 21:04 ` rearnsha at gcc dot gnu.org
  2015-02-15  6:17 ` daniel.santos at pobox dot com
  8 siblings, 0 replies; 10+ messages in thread
From: daniel.santos at pobox dot com @ 2012-11-15 21:56 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54829

--- Comment #7 from Daniel Santos <daniel.santos at pobox dot com> 2012-11-15 21:56:02 UTC ---
First off, I apologize for my late response here.

(In reply to comment #5)
I'm going to respond a little backwards..

> In fact, on ARM there is no branch instruction that can be used for "> 0" as a
> side effect of a subtract.  To get the desired effect the code would have to be
> completely re-arranged to factor out the "< 0" (bmi) and then "== 0" (beq)
> cases first.

I'm not an ARM programmer, but I'm looking at my reference book and it would
appear that BGT would perform a branch of greater than for signed comparison
and and BHI for unsigned comparison.  Again, convert the subtraction into a
comparison (subtract, but discard the result) and branch based upon the flags
(for signed numbers):

    cmp    r0, r1
    bgt    .L1
    bne    .L2
;handle equality here

Alternately, if the code to execute for each condition will not change the
result flags, then the use of condition instructions could replace the
branches.  In this example there is no need to check for equality because the
previous two branch instructions eliminate all other possibilities.  The reason
that my C code example above does *not* check for equality is that it is the
least likely condition and can be determined by eliminating the two most likely
conditions (negative or positive and not equal).

> The result of the comparison is used in more than one instruction, so combine
> cannot safely rework the branch instructions that follow to ensure that the
> result of the subtract is used correctly.

I suppose I'm going to have to learn the gcc internals.  It will probably be
good for me anyway.

However, If what you say is correct, then the *problem* lies at a higher level
than the "combine", but it does not invalidate the problem its self.  Where do
we make the decision that a result can be discarded?  That would seem to be
where the cause of this problem lies.

So to break it down more accurately, if all of these conditions are true:

1.) we perform an integral subtraction,
2.) and every use of the result can be replaced with fewer instructions that
use the CPU flags that were changed by the subtraction instead of the result
its self,
3.) and no instructions between the subtraction and the last use of its result
modify those flags

then we should replace the subtract operation with a compare and use the CPU
flags instead.  I am not aware of any case, when both a and b are of the same 
signed integral type, where "(a - b) > 0" and "(a - b) < 0" cannot be replaced
with "a > b" and "a < b", respectively.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/54829] bad optimization: sub followed by cmp w/ zero (x86 & ARM)
  2012-10-05 23:16 [Bug target/54829] New: bad optimization: sub followed by cmp w/ zero (x86 & ARM) daniel.santos at pobox dot com
                   ` (6 preceding siblings ...)
  2012-11-15 21:56 ` daniel.santos at pobox dot com
@ 2013-08-05 21:04 ` rearnsha at gcc dot gnu.org
  2015-02-15  6:17 ` daniel.santos at pobox dot com
  8 siblings, 0 replies; 10+ messages in thread
From: rearnsha at gcc dot gnu.org @ 2013-08-05 21:04 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54829

--- Comment #8 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
(In reply to Daniel Santos from comment #7)
> First off, I apologize for my late response here.
> 
> (In reply to comment #5)
> I'm going to respond a little backwards..
> 
> > In fact, on ARM there is no branch instruction that can be used for "> 0" as a
> > side effect of a subtract.  To get the desired effect the code would have to be
> > completely re-arranged to factor out the "< 0" (bmi) and then "== 0" (beq)
> > cases first.
> 
> I'm not an ARM programmer, but I'm looking at my reference book and it would
> appear that BGT would perform a branch of greater than for signed comparison
> and and BHI for unsigned comparison.  Again, convert the subtraction into a
> comparison (subtract, but discard the result) and branch based upon the
> flags (for signed numbers):
> 
>     cmp    r0, r1
>     bgt    .L1
>     bne    .L2
> ;handle equality here
> 

Unfortunately, computers don't to infinite precision arithmetic by default. 
That would perform a different comparison in that it checks that r0 > r1, not
whether r0 - r1 > 0.  The difference, for signed comparisons, is when overflow
occurs.

Consider the case where (in your original code) a has the value INT_MIN (ie
-2147483648) and b has the value 1.

Now clearly a < b and by the normal rules of arithmetic (infinite precision) we
would expect a - b to be less than zero.

However, INT_MIN - 1 cannot be represented in a 32-bit long value and becomes
INT_MAX due to overflow; the result is that for these values a - b > 0!

On ARM and x86, the flag setting that results from a subtract operation is, in
effect a comparison of the original operands, rather than a comparison of the
result; that is on ARM

   subs rd, rn, rm

is equivalent to 

   cmp rn, rm

except that the register rd is not written by the comparison.

Power PC is different: it's subtract and compare instruction really does use
the result of the subtraction to form the comparison.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/54829] bad optimization: sub followed by cmp w/ zero (x86 & ARM)
  2012-10-05 23:16 [Bug target/54829] New: bad optimization: sub followed by cmp w/ zero (x86 & ARM) daniel.santos at pobox dot com
                   ` (7 preceding siblings ...)
  2013-08-05 21:04 ` rearnsha at gcc dot gnu.org
@ 2015-02-15  6:17 ` daniel.santos at pobox dot com
  8 siblings, 0 replies; 10+ messages in thread
From: daniel.santos at pobox dot com @ 2015-02-15  6:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54829

--- Comment #9 from Daniel Santos <daniel.santos at pobox dot com> ---
I appologize for my late response.

(In reply to Richard Earnshaw from comment #8)
> Unfortunately, computers don't to infinite precision arithmetic by default. 
> That would perform a different comparison in that it checks that r0 > r1,
> not whether r0 - r1 > 0.  The difference, for signed comparisons, is when
> overflow occurs.
> 
> Consider the case where (in your original code) a has the value INT_MIN (ie
> -2147483648) and b has the value 1.
> 
> Now clearly a < b and by the normal rules of arithmetic (infinite precision)
> we would expect a - b to be less than zero.
> 
> However, INT_MIN - 1 cannot be represented in a 32-bit long value and
> becomes INT_MAX due to overflow; the result is that for these values a - b >
> 0!
> 
> On ARM and x86, the flag setting that results from a subtract operation is,
> in effect a comparison of the original operands, rather than a comparison of
> the result; that is on ARM
> 
>    subs rd, rn, rm
> 
> is equivalent to 
> 
>    cmp rn, rm
> 
> except that the register rd is not written by the comparison.
> 
> Power PC is different: it's subtract and compare instruction really does use
> the result of the subtraction to form the comparison.

Thank you very much for your work on this. In re-examining, I'm suspecting that
this may be an invalid bug. :( I have modified the test program slightly:

extern print_gt(void);
extern print_lt(void);
extern print_eq(void);

void cmp_and_branch(long a, long b)
{
    long diff = a > b ? 1 : (a < b ? -1 : 0);

    if (diff > 0) {
        print_gt();
    } else if (diff < 0) {
        print_lt();
    } else {
        print_eq();
    }
}

I thought that I had originally tried this and gotten worse results (although
the diff was being done via a complicated -findirect-inline situation), but
this version of the program leaves a finite number of options. When compiled on
x86_64 and ARM, both are flawless:

ARM
cmp_and_branch:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        cmp     r0, r1
        bgt     .L2
        blt     .L5
        b       print_eq
.L2:
        b       print_gt
.L5:
        b       print_lt
        .size   cmp_and_branch, .-cmp_and_branch
        .ident  "GCC: (Gentoo 4.8.3 p1.1, pie-0.5.9) 4.8.3"
        .section        .note.GNU-stack,"",%progbits


x86_64
cmp_and_branch:
.LFB0:
        .cfi_startproc
        cmpq    %rsi, %rdi
        jg      .L2
        jl      .L5
        jmp     print_eq
        .p2align 4,,10
        .p2align 3
.L2:
        jmp     print_gt
        .p2align 4,,10
        .p2align 3
.L5:
        jmp     print_lt
        .cfi_endproc

I don't want to close this bug just yet, I want to reset in my other code. This
will certainly help clean up some of my code!!


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-02-15  6:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-05 23:16 [Bug target/54829] New: bad optimization: sub followed by cmp w/ zero (x86 & ARM) daniel.santos at pobox dot com
2012-10-05 23:40 ` [Bug target/54829] " pinskia at gcc dot gnu.org
2012-10-06 13:31 ` [Bug target/54829] bad optimization: sub followed by cmp w/ zero (ARM) hjl.tools at gmail dot com
2012-10-06 15:56 ` [Bug target/54829] bad optimization: sub followed by cmp w/ zero (x86 & ARM) daniel.santos at pobox dot com
2012-10-06 15:57 ` daniel.santos at pobox dot com
2012-10-13 16:05 ` rearnsha at gcc dot gnu.org
2012-10-13 16:18 ` rearnsha at gcc dot gnu.org
2012-11-15 21:56 ` daniel.santos at pobox dot com
2013-08-05 21:04 ` rearnsha at gcc dot gnu.org
2015-02-15  6:17 ` daniel.santos at pobox dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).