public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/94650] New: Missed x86-64 peephole optimization: x >= large power of two
@ 2020-04-18 18:28 pascal_cuoq at hotmail dot com
  2020-04-20  7:05 ` [Bug target/94650] " rguenth at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: pascal_cuoq at hotmail dot com @ 2020-04-18 18:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94650

            Bug ID: 94650
           Summary: Missed x86-64 peephole optimization: x >= large power
                    of two
           Product: gcc
           Version: 9.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pascal_cuoq at hotmail dot com
  Target Milestone: ---

Consider the three functions check, test0 and test1:

(Compiler Explorer link: https://gcc.godbolt.org/z/Sh4GpR )

#include <string.h>

#define LARGE_POWER_OF_TWO (1UL << 40)

int check(unsigned long m)
{
    return m >= LARGE_POWER_OF_TWO;
}

void g(int);

void test0(unsigned long m)
{
    if (m >= LARGE_POWER_OF_TWO) g(0);
}

void test1(unsigned long m)
{
    if (m >= LARGE_POWER_OF_TWO) g(m);
}

At least in the case of check and test0, the optimal way to compare m to 1<<40
is to shift m by 40 and compare the result to 0. This is the code generated for
these functions by Clang 10:

check:                                  # @check
        xorl    %eax, %eax
        shrq    $40, %rdi
        setne   %al
        retq
test0:                                  # @test0
        shrq    $40, %rdi
        je      .LBB1_1
        xorl    %edi, %edi
        jmp     g                       # TAILCALL
.LBB1_1:
        retq

In contrast, GCC 9.3 uses a 64-bit constant that needs to be loaded in a
register with movabsq:

check:
        movabsq $1099511627775, %rax
        cmpq    %rax, %rdi
        seta    %al
        movzbl  %al, %eax
        ret
test0:
        movabsq $1099511627775, %rax
        cmpq    %rax, %rdi
        ja      .L5
        ret
.L5:
        xorl    %edi, %edi
        jmp     g


In the case of the function test1 the comparison is between these two version,
because the shift is destructive:

Clang10:
test1:                                  # @test1
        movq    %rdi, %rax
        shrq    $40, %rax
        je      .LBB2_1
        jmp     g                       # TAILCALL
.LBB2_1:
        retq

GCC9.3:
test1:
        movabsq $1099511627775, %rax
        cmpq    %rax, %rdi
        ja      .L8
        ret
.L8:
        jmp     g

It is less obvious which approach is better in the case of the function test1,
but generally speaking the shift approach should still be faster. The
register-register move can be free on Skylake (in the sense of not needing any
execution port), whereas movabsq requires an execution port and also it's a
10-byte instruction!

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-05-04 11:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-18 18:28 [Bug target/94650] New: Missed x86-64 peephole optimization: x >= large power of two pascal_cuoq at hotmail dot com
2020-04-20  7:05 ` [Bug target/94650] " rguenth at gcc dot gnu.org
2020-04-20 16:10 ` ubizjak at gmail dot com
2020-05-04 11:50 ` cvs-commit at gcc dot gnu.org
2020-05-04 11:54 ` ubizjak at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).