[Bug tree-optimization/95784] New: Failure to optimize usage of __builtin_add

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/95784] New: Failure to optimize usage of __builtin_add_overflow with return statement properly
@ 2020-06-20 10:13 gabravier at gmail dot com
  2020-06-20 12:09 ` [Bug target/95784] " pinskia at gcc dot gnu.org
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: gabravier at gmail dot com @ 2020-06-20 10:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95784

            Bug ID: 95784
           Summary: Failure to optimize usage of __builtin_add_overflow
                    with return statement properly
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gabravier at gmail dot com
  Target Milestone: ---

int f(uint8_t operand, int8_t *result)
{
    if (__builtin_add_overflow(operand, 0, result))
    {
        *result = 0;
        return 10213;
    }
    return 0;
}

With -O3, LLVM outputs this :

f(unsigned char, signed char*): # @f(unsigned char, signed char*)
  movsx eax, dil
  xor ecx, ecx
  cmp edi, eax
  cmovne edi, ecx
  mov byte ptr [rsi], dil
  mov eax, 10213
  cmove eax, ecx
  ret

GCC outputs this :

f(unsigned char, signed char*):
  movzx eax, dil
  movsx di, dil
  cmp ax, di
  setne dl
  mov r8d, edx
  sal r8d, 31
  sar r8d, 31
  and r8d, 10213
  test dl, dl
  mov edx, 0
  cmovne eax, edx
  mov BYTE PTR [rsi], al
  mov eax, r8d
  ret

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/95784] Failure to optimize usage of __builtin_add_overflow with return statement properly
  2020-06-20 10:13 [Bug tree-optimization/95784] New: Failure to optimize usage of __builtin_add_overflow with return statement properly gabravier at gmail dot com
@ 2020-06-20 12:09 ` pinskia at gcc dot gnu.org
  2020-06-20 14:01 ` gabravier at gmail dot com
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2020-06-20 12:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95784

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |target

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I dont see an issue here. Gcc is avoiding one cmov by doing a sign extend from
bit 0 (sal/sar) and an and instruction. 

Cmov on x86 processors is a bit weird and sometimes slower than using
arthematic instructions.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/95784] Failure to optimize usage of __builtin_add_overflow with return statement properly
  2020-06-20 10:13 [Bug tree-optimization/95784] New: Failure to optimize usage of __builtin_add_overflow with return statement properly gabravier at gmail dot com
  2020-06-20 12:09 ` [Bug target/95784] " pinskia at gcc dot gnu.org
@ 2020-06-20 14:01 ` gabravier at gmail dot com
  2020-06-20 14:02 ` gabravier at gmail dot com
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: gabravier at gmail dot com @ 2020-06-20 14:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95784

--- Comment #2 from Gabriel Ravier <gabravier at gmail dot com> ---
cmov is so slow that :
- 1 movzx
- 1 setcc
- 1 sal
- 1 sar
- 1 and
- 1 test

is worth avoiding it ? From what I can see, the dependencies for the LLVM
version for the first cmov are :
- The left operand, which the preceding cmp already depends on
- EFLAGS from the cmp
- The 0 from the xor

And the dependencies for the second cmov are :
- The left operand from the preceding mov
- The right operand, which the preceding cmov already depends on
- EFLAGS, which the preceding cmov already depends on

And I can't see how the dependencies for the second cmov can slow it down so
much that doing all the extra computing is worth it.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/95784] Failure to optimize usage of __builtin_add_overflow with return statement properly
  2020-06-20 10:13 [Bug tree-optimization/95784] New: Failure to optimize usage of __builtin_add_overflow with return statement properly gabravier at gmail dot com
  2020-06-20 12:09 ` [Bug target/95784] " pinskia at gcc dot gnu.org
  2020-06-20 14:01 ` gabravier at gmail dot com
@ 2020-06-20 14:02 ` gabravier at gmail dot com
  2020-06-20 14:25 ` gabravier at gmail dot com
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: gabravier at gmail dot com @ 2020-06-20 14:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95784

--- Comment #3 from Gabriel Ravier <gabravier at gmail dot com> ---
The most important problem here is really that all the computations for r8d are
dependant on each other, too, the sal+sar+and chain all depend on the previous
operation, the LLVM version seems much better for OOO than the GCC version.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/95784] Failure to optimize usage of __builtin_add_overflow with return statement properly
  2020-06-20 10:13 [Bug tree-optimization/95784] New: Failure to optimize usage of __builtin_add_overflow with return statement properly gabravier at gmail dot com
                   ` (2 preceding siblings ...)
  2020-06-20 14:02 ` gabravier at gmail dot com
@ 2020-06-20 14:25 ` gabravier at gmail dot com
  2020-06-20 14:27 ` gabravier at gmail dot com
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: gabravier at gmail dot com @ 2020-06-20 14:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95784

--- Comment #4 from Gabriel Ravier <gabravier at gmail dot com> ---
I've tried to benchmark this on my "Intel(R) Core(TM) i5-4310U CPU @ 2.00GHz"
(this is from /proc/cpuinfo) and I'm getting some really weird results, so if
you want to try to assist me in benchmarking (which should determine which
version really is faster) that'd be welcome. I've tried to benchmark this with
the attached test.S source file and got these results :

$ gcc test.S -O3 -ggdb3 -DGCC_VERSION && time ./a.out && gcc test.S -O3 -ggdb3
-DLLVM_VERSION && time ./a.out && gcc test.S -O3 -ggdb3 && time ./a.out

real    0m2.210s # GCC version
user    0m2.200s
sys     0m0.001s

real    0m1.825s # LLVM version is faster as I expected
user    0m1.817s
sys     0m0.001s

real    0m2.080s # Doing nothing is somehow slower than the LLVM version ?
user    0m2.071s
sys     0m0.001s

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/95784] Failure to optimize usage of __builtin_add_overflow with return statement properly
  2020-06-20 10:13 [Bug tree-optimization/95784] New: Failure to optimize usage of __builtin_add_overflow with return statement properly gabravier at gmail dot com
                   ` (3 preceding siblings ...)
  2020-06-20 14:25 ` gabravier at gmail dot com
@ 2020-06-20 14:27 ` gabravier at gmail dot com
  2020-06-20 14:33 ` gabravier at gmail dot com
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: gabravier at gmail dot com @ 2020-06-20 14:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95784

--- Comment #5 from Gabriel Ravier <gabravier at gmail dot com> ---
Created attachment 48760
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48760&action=edit
File for benchmarking this function

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/95784] Failure to optimize usage of __builtin_add_overflow with return statement properly
  2020-06-20 10:13 [Bug tree-optimization/95784] New: Failure to optimize usage of __builtin_add_overflow with return statement properly gabravier at gmail dot com
                   ` (4 preceding siblings ...)
  2020-06-20 14:27 ` gabravier at gmail dot com
@ 2020-06-20 14:33 ` gabravier at gmail dot com
  2020-06-20 15:51 ` jakub at gcc dot gnu.org
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: gabravier at gmail dot com @ 2020-06-20 14:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95784

--- Comment #6 from Gabriel Ravier <gabravier at gmail dot com> ---
Created attachment 48761
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48761&action=edit
File for benchmarking this function but everything is aligned properly.

I've changed the source file slightly, it looks like the LLVM version was
faster than the "do nothing" version because the loop was misaligned. This is
the test results I get with the version with aligned loops (I've also adjusted
the amount of iterations) :

$ gcc test.S -O3 -ggdb3 -DGCC_VERSION && time ./a.out && gcc test.S -O3 -ggdb3
-DLLVM_VERSION && time ./a.out && gcc test.S -O3 -ggdb3 && time ./a.out

real    0m3.130s # GCC version
user    0m3.122s
sys     0m0.001s

real    0m2.599s # LLVM version
user    0m2.593s
sys     0m0.001s

real    0m2.597s # version that does nothing
user    0m2.591s
sys     0m0.000s

I can now note that the LLVM version is now almost as fast as literally doing
nothing, so now it looks really much better than the GCC version, at least to
me.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/95784] Failure to optimize usage of __builtin_add_overflow with return statement properly
  2020-06-20 10:13 [Bug tree-optimization/95784] New: Failure to optimize usage of __builtin_add_overflow with return statement properly gabravier at gmail dot com
                   ` (5 preceding siblings ...)
  2020-06-20 14:33 ` gabravier at gmail dot com
@ 2020-06-20 15:51 ` jakub at gcc dot gnu.org
  2020-06-20 19:05 ` gabravier at gmail dot com
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-06-20 15:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95784

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
This is not a good benchmark, you aren't actually using the result of the cmov
(e.g. in the caller) in any way, so no wonder you don't care about the terrible
latency it has.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/95784] Failure to optimize usage of __builtin_add_overflow with return statement properly
  2020-06-20 10:13 [Bug tree-optimization/95784] New: Failure to optimize usage of __builtin_add_overflow with return statement properly gabravier at gmail dot com
                   ` (6 preceding siblings ...)
  2020-06-20 15:51 ` jakub at gcc dot gnu.org
@ 2020-06-20 19:05 ` gabravier at gmail dot com
  2020-06-22  8:05 ` rguenth at gcc dot gnu.org
  2021-08-03  3:44 ` [Bug rtl-optimization/95784] " pinskia at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: gabravier at gmail dot com @ 2020-06-20 19:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95784

Gabriel Ravier <gabravier at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #48760|0                           |1
        is obsolete|                            |
  Attachment #48761|0                           |1
        is obsolete|                            |

--- Comment #8 from Gabriel Ravier <gabravier at gmail dot com> ---
Created attachment 48763
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48763&action=edit
File for benchmarking this function but everything is aligned properly and the
result is used

I've added another benchmark that uses the result of the operation (stores it
on the stack). Here are the average time taken to execute the program over 10
runs of each version :

GCC's version : 2.418 seconds
LLVM's version : 2.160 seconds
The "do nothing" version : 2.107 seconds

Do tell me if the benchmark is still flawed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/95784] Failure to optimize usage of __builtin_add_overflow with return statement properly
  2020-06-20 10:13 [Bug tree-optimization/95784] New: Failure to optimize usage of __builtin_add_overflow with return statement properly gabravier at gmail dot com
                   ` (7 preceding siblings ...)
  2020-06-20 19:05 ` gabravier at gmail dot com
@ 2020-06-22  8:05 ` rguenth at gcc dot gnu.org
  2021-08-03  3:44 ` [Bug rtl-optimization/95784] " pinskia at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-06-22  8:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95784

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
I wouldn't be surprised if a version with a branch is faster even with each
of the branches mispredicted.  cmovs are weird beasts but since they
are not dependent on each other their latency at least shouldn't add up here
so LLVMs two cmovs shouldnt be worse off than GCCs one cmov.  You'd need to
compare against a variant without any cmov.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug rtl-optimization/95784] Failure to optimize usage of __builtin_add_overflow with return statement properly
  2020-06-20 10:13 [Bug tree-optimization/95784] New: Failure to optimize usage of __builtin_add_overflow with return statement properly gabravier at gmail dot com
                   ` (8 preceding siblings ...)
  2020-06-22  8:05 ` rguenth at gcc dot gnu.org
@ 2021-08-03  3:44 ` pinskia at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-03  3:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95784

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2021-08-03
             Status|UNCONFIRMED                 |NEW
          Component|target                      |rtl-optimization
     Ever confirmed|0                           |1

--- Comment #10 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Even aarch64 has:

        cmp     w3, w4, uxth
        cset    w3, ne
        cmp     w3, 0
        csel    w2, w2, wzr, eq
        csel    w0, w0, wzr, ne

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-08-03  3:44 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-20 10:13 [Bug tree-optimization/95784] New: Failure to optimize usage of __builtin_add_overflow with return statement properly gabravier at gmail dot com
2020-06-20 12:09 ` [Bug target/95784] " pinskia at gcc dot gnu.org
2020-06-20 14:01 ` gabravier at gmail dot com
2020-06-20 14:02 ` gabravier at gmail dot com
2020-06-20 14:25 ` gabravier at gmail dot com
2020-06-20 14:27 ` gabravier at gmail dot com
2020-06-20 14:33 ` gabravier at gmail dot com
2020-06-20 15:51 ` jakub at gcc dot gnu.org
2020-06-20 19:05 ` gabravier at gmail dot com
2020-06-22  8:05 ` rguenth at gcc dot gnu.org
2021-08-03  3:44 ` [Bug rtl-optimization/95784] " pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).