[Bug tree-optimization/96633] New: missed optimization?

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/96633] New: missed optimization?
@ 2020-08-16 21:25 nathan at gcc dot gnu.org
  2020-08-17  8:14 ` [Bug tree-optimization/96633] " marxin at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: nathan at gcc dot gnu.org @ 2020-08-16 21:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96633

            Bug ID: 96633
           Summary: missed optimization?
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: nathan at gcc dot gnu.org
  Target Milestone: ---

Matt Godbolt's https://queue.acm.org/detail.cfm?id=3372264
has an example of optimizing on amd64:

bool isWhitespace(char c)
 {
     return c == ' '
       || c == '\r'
       || c == '\n'
       || c == '\t';
 }

GCC generates: 
        xorl    %eax, %eax
        cmpb    $32, %dil
        ja      .L1
        movabsq $4294977024, %rax
        movl    %edi, %ecx
        shrq    %cl, %rax
        andl    $1, %eax
.L1:
        ret

following an amazing comment on the ML, I've determined the following is abot
12% faster (and shorter too):

        movabsq $4294977024, %rax
        movl   %edi, %ecx
        shrq    %cl, %rax
        shr    $6, %ecx
        andl    $1, %eax
        shrq    %cl, %rax
        ret

We're dealing with chars in the range [-128,128), and x86's shift operator only
considers the bottom 6 bits.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/96633] missed optimization?
  2020-08-16 21:25 [Bug tree-optimization/96633] New: missed optimization? nathan at gcc dot gnu.org
@ 2020-08-17  8:14 ` marxin at gcc dot gnu.org
  2020-08-17 12:07 ` amonakov at gcc dot gnu.org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: marxin at gcc dot gnu.org @ 2020-08-17  8:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96633

Martin Liška <marxin at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2020-08-17
             Status|UNCONFIRMED                 |NEW
           Keywords|                            |missed-optimization
                 CC|                            |marxin at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Martin Liška <marxin at gcc dot gnu.org> ---
Nice trick!
Generally speaking about the optimization (for longer types), it can be tricky
as:

        shr    $6, %ecx
can be bigger than 64 and so
        shrq    %cl, %rax

can result in undefined behavior as 1 << X, where X > 64.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/96633] missed optimization?
  2020-08-16 21:25 [Bug tree-optimization/96633] New: missed optimization? nathan at gcc dot gnu.org
  2020-08-17  8:14 ` [Bug tree-optimization/96633] " marxin at gcc dot gnu.org
@ 2020-08-17 12:07 ` amonakov at gcc dot gnu.org
  2020-08-20 11:36 ` marxin at gcc dot gnu.org
  2021-07-21  1:05 ` [Bug middle-end/96633] " pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: amonakov at gcc dot gnu.org @ 2020-08-17 12:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96633

--- Comment #2 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Martin added me to CC so I assume he wants me to chime in.

First of all, I find Nathan's behavior in that gcc@ thread distasteful at best
(but if you ask me, such responses are simply more harm than good; link:
https://lwn.net/ml/gcc/1a363f89-6f98-f583-e22a-a7fc02efb4db@acm.org/ ).

Next, statements like "I've determined the following is abot 12% faster" don't
carry weight without details such as the CPU family, structure of the benchmark
and the workload. Obviously, on input that lacks whitespace GCC's original code
is faster as the initial branch is 100% predictable. Likewise, if the input was
taken from /dev/random, the 12% figure is irrelevant to real-world uses of such
code. What the benchmark is doing with the return value of the function also
matters a lot.

With that out of the way: striving to get efficient branchless code on this
code is not very valuable in practice, because the caller is likely to perform
a conditional branch on the result anyway. So making isWhitespace branchless
simply moves the misprediction cost to the caller, making the overall code
slower.

(but of course such considerations are too complex for the compiler's limited
brain)

In general such "bitmask tests" will benefit from the BT instruction on x86
(not an extension, was in the ISA since before I was born), plus CMOV to get
the right mask if it doesn't fit in a register.

For 100% branchless code we want to generate code similar to:

char is_ws(char c)
{
        unsigned long long mask = 1ll<<' ' | 1<<'\t' | 1<<'\r' | 1<<'\n';
        unsigned long long v = c;
        if (v > 32)
#if 1
            mask = 0;
#else
            return 0;
#endif
        char r;
        asm("bt %1, %2; setc %0" : "=r"(r) : "r"(v), "r"(mask));
        return r;
}

        movsbq  %dil, %rax
        movl    $0, %edx
        movabsq $4294977024, %rdi
        cmpq    $33, %rax
        cmovnb  %rdx, %rdi
        bt %rax, %rdi; setc %al
        ret

(note we get %edx zeroing suboptimal, should have used xor %edx, %edx)

This is generalizable to any input type, not just char.

We even already get the "test against a mask" part of the idea right ;)

Branchy testing is even cheaper with BT:

void is_ws_cb(unsigned char c, void f(void))
{
        unsigned long long mask = 1ll<<' ' | 1<<'\t' | 1<<'\r' | 1<<'\n';
        if (c <= 32 && (mask & (1ll<<c)))
            f();
}

        cmpb    $32, %dil
        ja      .L5
        movabsq $4294977024, %rax
        btq     %rdi, %rax
        jnc     .L5
        jmp     *%rsi
.L5:
        ret

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/96633] missed optimization?
  2020-08-16 21:25 [Bug tree-optimization/96633] New: missed optimization? nathan at gcc dot gnu.org
  2020-08-17  8:14 ` [Bug tree-optimization/96633] " marxin at gcc dot gnu.org
  2020-08-17 12:07 ` amonakov at gcc dot gnu.org
@ 2020-08-20 11:36 ` marxin at gcc dot gnu.org
  2021-07-21  1:05 ` [Bug middle-end/96633] " pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: marxin at gcc dot gnu.org @ 2020-08-20 11:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96633

--- Comment #3 from Martin Liška <marxin at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #2)
> Martin added me to CC so I assume he wants me to chime in.

Yes, you understood very well and thanks for the reply.

... 

> With that out of the way: striving to get efficient branchless code on this
> code is not very valuable in practice, because the caller is likely to
> perform a conditional branch on the result anyway. So making isWhitespace
> branchless simply moves the misprediction cost to the caller, making the
> overall code slower.

That's very important observation!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug middle-end/96633] missed optimization?
  2020-08-16 21:25 [Bug tree-optimization/96633] New: missed optimization? nathan at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2020-08-20 11:36 ` marxin at gcc dot gnu.org
@ 2021-07-21  1:05 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-07-21  1:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96633

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=46235

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
On the trunk we get:
        cmpb    $32, %dil
        ja      .L3
        movabsq $4294977024, %rax
        btq     %rdi, %rax
        setc    %al
        ret
.L3:
        xorl    %eax, %eax
        ret


Which is close.

The reason why we produce a bt now is because of PR46235.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-07-21  1:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-16 21:25 [Bug tree-optimization/96633] New: missed optimization? nathan at gcc dot gnu.org
2020-08-17  8:14 ` [Bug tree-optimization/96633] " marxin at gcc dot gnu.org
2020-08-17 12:07 ` amonakov at gcc dot gnu.org
2020-08-20 11:36 ` marxin at gcc dot gnu.org
2021-07-21  1:05 ` [Bug middle-end/96633] " pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).