public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Peephole optimisation: isWhitespace()
@ 2020-08-14 16:43 Stefan Kanthak
  2020-08-15 21:41 ` Nathan Sidwell
  2020-08-17 14:31 ` Allan Sandfeld Jensen
  0 siblings, 2 replies; 13+ messages in thread
From: Stefan Kanthak @ 2020-08-14 16:43 UTC (permalink / raw)
  To: gcc

Hi @ll,

in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>,
Matt Godbolt used the function

| bool isWhitespace(char c)
| {
|     return c == ' '
|       || c == '\r'
|       || c == '\n'
|       || c == '\t';
| }

as an example, for which GCC 9.1 emits the following assembly for AMD64
processors (see <https://godbolt.org/z/acm19_conds>):

|    xor    eax, eax              ; result = false
|    cmp    dil, 32               ; is c > 32
|    ja     .L4                   ; if so, exit with false
|    movabs rax, 4294977024       ; rax = 0x100002600
|    shrx   rax, rax, rdi         ; rax >>= c
|    and    eax, 1                ; result = rax & 1
|.L4:
|    ret

This code is but not optimal!
The following equivalent and branchless code works on i386 too,
it needs neither an AMD64 processor nor the SHRX instruction,
which is not available on older processors:

     mov    ecx, edi
     mov    eax, 2600h            ; eax = (1 << '\r') | (1 << '\n') | (1 << '\t')
     test   cl, cl
     setnz  al                    ; eax |= (c != '\0')
     shr    eax, cl               ; eax >>= (c % ' ')
     xor    edx, edx
     cmp    ecx, 33               ; CF = c <= ' '
     adc    edx, edx              ; edx = (c <= ' ')
     and    eax, edx
     ret


regards
Stefan Kanthak

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Peephole optimisation: isWhitespace()
  2020-08-14 16:43 Peephole optimisation: isWhitespace() Stefan Kanthak
@ 2020-08-15 21:41 ` Nathan Sidwell
  2020-08-16 13:54   ` Stefan Kanthak
  2020-08-17 14:31 ` Allan Sandfeld Jensen
  1 sibling, 1 reply; 13+ messages in thread
From: Nathan Sidwell @ 2020-08-15 21:41 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: gcc

On 8/14/20 12:43 PM, Stefan Kanthak wrote:
> Hi @ll,
> 
> in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>,
> Matt Godbolt used the function
> 
> | bool isWhitespace(char c)
> | {
> |     return c == ' '
> |       || c == '\r'
> |       || c == '\n'
> |       || c == '\t';
> | }
> 
> as an example, for which GCC 9.1 emits the following assembly for AMD64
> processors (see <https://godbolt.org/z/acm19_conds>):
> 
> |    xor    eax, eax              ; result = false
> |    cmp    dil, 32               ; is c > 32
> |    ja     .L4                   ; if so, exit with false
> |    movabs rax, 4294977024       ; rax = 0x100002600
> |    shrx   rax, rax, rdi         ; rax >>= c
> |    and    eax, 1                ; result = rax & 1
> |.L4:
> |    ret
> 
> This code is but not optimal!

What evidence do you have that your alternative sequence performs better?  Have 
you benchmarked it?  (I tried, but your code doesn't assemble)

It is more instructions and cannot speculate past the setnz (As I understand it, 
x86_64 speculates branch instructions, but doesn't speculate cmov -- so 
perversely branches are faster!)

> The following equivalent and branchless code works on i386 too,
> it needs neither an AMD64 processor nor the SHRX instruction,
> which is not available on older processors:

> 
>       mov    ecx, edi
>       mov    eax, 2600h            ; eax = (1 << '\r') | (1 << '\n') | (1 << '\t')
>       test   cl, cl
>       setnz  al                    ; eax |= (c != '\0')
>       shr    eax, cl               ; eax >>= (c % ' ')

^^ operand type mismatch on this instruction

>       xor    edx, edx
>       cmp    ecx, 33               ; CF = c <= ' '
>       adc    edx, edx              ; edx = (c <= ' ')
>       and    eax, edx
>       ret
> 
> 
> regards
> Stefan Kanthak
> 


-- 
Nathan Sidwell

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Peephole optimisation: isWhitespace()
  2020-08-15 21:41 ` Nathan Sidwell
@ 2020-08-16 13:54   ` Stefan Kanthak
  2020-08-16 20:18     ` Nathan Sidwell
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Kanthak @ 2020-08-16 13:54 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: gcc

"Nathan Sidwell" <nathan@acm.org> wrote:

> On 8/14/20 12:43 PM, Stefan Kanthak wrote:
>> Hi @ll,
>> 
>> in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>,
>> Matt Godbolt used the function
>> 
>> | bool isWhitespace(char c)
>> | {
>> |     return c == ' '
>> |       || c == '\r'
>> |       || c == '\n'
>> |       || c == '\t';
>> | }
>> 
>> as an example, for which GCC 9.1 emits the following assembly for AMD64
>> processors (see <https://godbolt.org/z/acm19_conds>):
>> 
>> |    xor    eax, eax              ; result = false
>> |    cmp    dil, 32               ; is c > 32
>> |    ja     .L4                   ; if so, exit with false
>> |    movabs rax, 4294977024       ; rax = 0x100002600
>> |    shrx   rax, rax, rdi         ; rax >>= c
>> |    and    eax, 1                ; result = rax & 1
>> |.L4:
>> |    ret
>> 
>> This code is but not optimal!
> 
> What evidence do you have that your alternative sequence performs
> better?

45+ years experience in writing assembly code!

> Have you benchmarked it?

Of course! Did you?
I didn't include the numbers in my initial post since I don't have
a processor which supports BMI2 and thus can't run the original code.
I benchmarked the following equivalent code (input character is in
ECX instead of EDI):

GCC:                      x64:                       x86:
    xor    eax, eax           movabs rax, 100002600h     mov   eax, 2600h
    cmp    cl, 32                                        test  cl, cl
    ja     L4                                            setnz al
    movabs rax, 100002600h    shr    rax, cl             shr   eax, cl
    shr    rax, cl            xor    edx, edx            xor   edx, edx
    and    eax, 1             cmp    ecx, 33             cmp   edx, 33
                              setb   dl                  setb  dl
L4:                           and    eax, edx            and   eax, edx
    ret                       ret                        ret

Left column 1 billion sequential characters
    for (int i=1000000000; i; --i) ...(i);
right column 1 billion random characters, in cycles per character:

GCC:   2.4    3.4
x64:   3.0    2.5
x86:   4.3    4.5

> (I tried, but your code doesn't assemble)

Wrong: YOUR code doesn't assemble, mine does! And it works well too.

> It is more instructions

Because I dared to show code for the old(er) i386 alias x86 processor,
not for the AMD64 alias x86_64.
I expect people commenting on assembly to understand (just a little bit)
of it and to get the idea of the code I posted first, then eventually be
able to come up with the following x86_64 code (without a fancy SHRX,
i.e. for ALL AMD64 processors, not just those supporting BMI2):

     mov    ecx, edi
     movabs rax, 100002600h       ; rax = (1 << ' ') | (1 << '\r') | (1 << '\n') | (1 << '\t')
     shr    rax, cl               ; rax >>= c
     xor    edi, edi
     cmp    ecx, 33               ; CF = c <= ' '
.if 0
     adc    edi, edi              ; edi = (c <= ' ')
.else
     setb   dil
.endif
     and    eax, edi              ; result = rax & (c <= ' ')
     ret


If someone (or GCC) but really insists to (ab)use SHRX:

     xor    ecx, ecx
     cmp    dil, 33               ; is !(c > 32)
     setb   cl
     movabs rax, 4294977024       ; rax = 0x100002600
     shrx   rax, rax, rdi         ; rax >>= c
     and    eax, ecx              ; result = rax & (c <= ' ')
     ret

Now start counting: first the instructions, then the cycles lost
due to wrong branch prediction (hint: 0 for my code).

> and cannot speculate past the setnz

OUCH!
Unfortunately, AMD64 and i386 processors are not THAT retarded.

> (As I understand it, x86_64 speculates branch instructions, but
> doesn't speculate cmov -- so perversely branches are faster!)

There's no CMOV in my code.
What are you trying to demonstrate?

JFTR: how much speculative executed instructions are discarded if
      the branch prediction is wrong?

>> The following equivalent and branchless code works on i386 too,
>> it needs neither an AMD64 processor nor the SHRX instruction,
>> which is not available on older processors:
> 
>> 
>>       mov    ecx, edi
>>       mov    eax, 2600h            ; eax = (1 << '\r') | (1 << '\n') | (1 << '\t')
>>       test   cl, cl
>>       setnz  al                    ; eax |= (c != '\0')
>>       shr    eax, cl               ; eax >>= (c % ' ')
> 
> ^^ operand type mismatch on this instruction

I recommend to fix YOUR typo!
I bet you wrote "shr eax, c" instead of "shr eax, cl" there.

>>       xor    edx, edx
>>       cmp    ecx, 33               ; CF = c <= ' '
>>       adc    edx, edx              ; edx = (c <= ' ')
>>       and    eax, edx
>>       ret

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Peephole optimisation: isWhitespace()
  2020-08-16 13:54   ` Stefan Kanthak
@ 2020-08-16 20:18     ` Nathan Sidwell
  2020-08-17 13:28       ` Stefan Kanthak
  0 siblings, 1 reply; 13+ messages in thread
From: Nathan Sidwell @ 2020-08-16 20:18 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: gcc

On 8/16/20 9:54 AM, Stefan Kanthak wrote:
> "Nathan Sidwell" <nathan@acm.org> wrote:

>> What evidence do you have that your alternative sequence performs
>> better?
> 
> 45+ years experience in writing assembly code!


> 
>> Have you benchmarked it?
> 
> Of course! Did you?
> I didn't include the numbers in my initial post since I don't have
> a processor which supports BMI2 and thus can't run the original code.
> I benchmarked the following equivalent code (input character is in
> ECX instead of EDI):

you seem very angry about being asked for data.  As I said, I couldn't benchmark 
your code, because of the incorrect assembly.

As some one with 45+years of writing assembly, you'll be aware that processor 
micro architectures have changed dramatically over that time, and one can very 
easily be misled by 'intuition'.
> 
> Because I dared to show code for the old(er) i386 alias x86 processor,
> not for the AMD64 alias x86_64.

Which I did find bizarre -- if you're targeting an x86_64 ISA, why are you 
writing code for a different processor?

anyway, you've made it clear you do not wish to engage in constructive discussion.

BTW, I have come up with a sequence as short as GCC's but without the 
conditional branch.  Sadly the margin is too small to write it.

Good day, sir

nathan

-- 
Nathan Sidwell

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Peephole optimisation: isWhitespace()
  2020-08-16 20:18     ` Nathan Sidwell
@ 2020-08-17 13:28       ` Stefan Kanthak
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Kanthak @ 2020-08-17 13:28 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: gcc

"Nathan Sidwell" <nathan@acm.org>

> On 8/16/20 9:54 AM, Stefan Kanthak wrote:
>> "Nathan Sidwell" <nathan@acm.org> wrote:

[...]

>>> Have you benchmarked it?
>> 
>> Of course! Did you?

[...]

> you seem very angry about being asked for data.

As much as you hallucinated about CMOV or processors not speculating beyond
SETNZ?

>  As I said, I couldn't benchmark your code, because of the incorrect assembly.

As I wrote, that's your fault: every x86 and its evolution x64 know SHR EAX, CL

[...]

> anyway, you've made it clear you do not wish to engage in constructive discussion.

Use a mirror!

> BTW, I have come up with a sequence as short as GCC's but without the 
> conditional branch.  Sadly the margin is too small to write it.

Reality surely bites!

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Peephole optimisation: isWhitespace()
  2020-08-14 16:43 Peephole optimisation: isWhitespace() Stefan Kanthak
  2020-08-15 21:41 ` Nathan Sidwell
@ 2020-08-17 14:31 ` Allan Sandfeld Jensen
  2020-08-17 17:07   ` Stefan Kanthak
  1 sibling, 1 reply; 13+ messages in thread
From: Allan Sandfeld Jensen @ 2020-08-17 14:31 UTC (permalink / raw)
  To: gcc; +Cc: Stefan Kanthak

On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
> Hi @ll,
> 
> in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>,
> Matt Godbolt used the function
> 
> | bool isWhitespace(char c)
> | {
> | 
> |     return c == ' '
> |     
> |       || c == '\r'
> |       || c == '\n'
> |       || c == '\t';
> | 
> | }
> 
> as an example, for which GCC 9.1 emits the following assembly for AMD64
> 
> processors (see <https://godbolt.org/z/acm19_conds>):
> |    xor    eax, eax              ; result = false
> |    cmp    dil, 32               ; is c > 32
> |    ja     .L4                   ; if so, exit with false
> |    movabs rax, 4294977024       ; rax = 0x100002600
> |    shrx   rax, rax, rdi         ; rax >>= c
> |    and    eax, 1                ; result = rax & 1
> |
> |.L4:
> |    ret
> 
No it doesn't. As your example shows if you took the time to read it, it is 
what gcc emit when generating code to run on a _haswell_ architecture. If you 
remove -march=haswell from the command line you get:

        xor     eax, eax
        cmp     dil, 32
        ja      .L1
        movabs  rax, 4294977024
        mov     ecx, edi
        shr     rax, cl
        and     eax, 1

It uses one mov more, but no shrx. 

'Allan



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Peephole optimisation: isWhitespace()
  2020-08-17 14:31 ` Allan Sandfeld Jensen
@ 2020-08-17 17:07   ` Stefan Kanthak
  2020-08-24  7:48     ` Richard Biener
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Kanthak @ 2020-08-17 17:07 UTC (permalink / raw)
  To: Allan Sandfeld Jensen, gcc

"Allan Sandfeld Jensen" <linux@carewolf.com> wrote:

> On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
>> Hi @ll,
>> 
>> in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>,
>> Matt Godbolt used the function
>> 
>> | bool isWhitespace(char c)
>> | {
>> | 
>> |     return c == ' '
>> |     
>> |       || c == '\r'
>> |       || c == '\n'
>> |       || c == '\t';
>> | 
>> | }
>> 
>> as an example, for which GCC 9.1 emits the following assembly for AMD64
>> 
>> processors (see <https://godbolt.org/z/acm19_conds>):
>> |    xor    eax, eax              ; result = false
>> |    cmp    dil, 32               ; is c > 32
>> |    ja     .L4                   ; if so, exit with false
>> |    movabs rax, 4294977024       ; rax = 0x100002600
>> |    shrx   rax, rax, rdi         ; rax >>= c
>> |    and    eax, 1                ; result = rax & 1
>> |
>> |.L4:
>> |    ret
>> 
> No it doesn't. As your example shows if you took the time to read it, it is 
> what gcc emit when generating code to run on a _haswell_ architecture.

Matt's article does NOT specify the architecture for THIS example.
He specified it for another example he named "(q)":

| When targeting the Haswell microarchitecture, GCC 8.2 compiles this code
| to the assembly in (q) (https://godbolt.org/z/acm19_bits):

WHat about CAREFUL reading?

> If you remove -march=haswell from the command line you get:
> 
>        xor     eax, eax
>        cmp     dil, 32
>        ja      .L1
>        movabs  rax, 4294977024
>        mov     ecx, edi
>        shr     rax, cl
>        and     eax, 1
> 
> It uses one mov more, but no shrx. 

The SHRX is NOT the point here; its the avoidable conditional branch that
matters!

         mov     ecx, edi
         movabs  rax, 4294977024
         shr     rax, cl
         xor     edi, edi
         cmp     ecx, 33
         setb    dil
         and     eax, edi

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Peephole optimisation: isWhitespace()
  2020-08-17 17:07   ` Stefan Kanthak
@ 2020-08-24  7:48     ` Richard Biener
  2020-08-24  8:11       ` Alexander Monakov
  2020-08-24 11:19       ` Stefan Kanthak
  0 siblings, 2 replies; 13+ messages in thread
From: Richard Biener @ 2020-08-24  7:48 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Allan Sandfeld Jensen, GCC Development

On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>
> "Allan Sandfeld Jensen" <linux@carewolf.com> wrote:
>
> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
> >> Hi @ll,
> >>
> >> in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>,
> >> Matt Godbolt used the function
> >>
> >> | bool isWhitespace(char c)
> >> | {
> >> |
> >> |     return c == ' '
> >> |
> >> |       || c == '\r'
> >> |       || c == '\n'
> >> |       || c == '\t';
> >> |
> >> | }
> >>
> >> as an example, for which GCC 9.1 emits the following assembly for AMD64
> >>
> >> processors (see <https://godbolt.org/z/acm19_conds>):
> >> |    xor    eax, eax              ; result = false
> >> |    cmp    dil, 32               ; is c > 32
> >> |    ja     .L4                   ; if so, exit with false
> >> |    movabs rax, 4294977024       ; rax = 0x100002600
> >> |    shrx   rax, rax, rdi         ; rax >>= c
> >> |    and    eax, 1                ; result = rax & 1
> >> |
> >> |.L4:
> >> |    ret
> >>
> > No it doesn't. As your example shows if you took the time to read it, it is
> > what gcc emit when generating code to run on a _haswell_ architecture.
>
> Matt's article does NOT specify the architecture for THIS example.
> He specified it for another example he named "(q)":
>
> | When targeting the Haswell microarchitecture, GCC 8.2 compiles this code
> | to the assembly in (q) (https://godbolt.org/z/acm19_bits):
>
> WHat about CAREFUL reading?
>
> > If you remove -march=haswell from the command line you get:
> >
> >        xor     eax, eax
> >        cmp     dil, 32
> >        ja      .L1
> >        movabs  rax, 4294977024
> >        mov     ecx, edi
> >        shr     rax, cl
> >        and     eax, 1
> >
> > It uses one mov more, but no shrx.
>
> The SHRX is NOT the point here; its the avoidable conditional branch that
> matters!

Whether or not the conditional branch sequence is faster depends on whether
the branch is well-predicted which very much depends on the data you
feed the isWhitespace function with but I guess since this is the
c == ' ' test it _will_ be a well-predicted branch which means the
conditional branch sequence will be usually faster.  The proposed
change turns the control into a data dependence which constrains
instruction scheduling and retirement.  Indeed a mispredicted branch
will likely be more costly.

x86 CPUs do not perform data speculation.

Richard.

>          mov     ecx, edi
>          movabs  rax, 4294977024
>          shr     rax, cl
>          xor     edi, edi
>          cmp     ecx, 33
>          setb    dil
>          and     eax, edi
>
> Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Peephole optimisation: isWhitespace()
  2020-08-24  7:48     ` Richard Biener
@ 2020-08-24  8:11       ` Alexander Monakov
  2020-08-24 11:19       ` Stefan Kanthak
  1 sibling, 0 replies; 13+ messages in thread
From: Alexander Monakov @ 2020-08-24  8:11 UTC (permalink / raw)
  To: Richard Biener; +Cc: Stefan Kanthak, GCC Development

On Mon, 24 Aug 2020, Richard Biener via Gcc wrote:

> Whether or not the conditional branch sequence is faster depends on whether
> the branch is well-predicted which very much depends on the data you
> feed the isWhitespace function with but I guess since this is the
> c == ' ' test it _will_ be a well-predicted branch which means the
> conditional branch sequence will be usually faster.  The proposed
> change turns the control into a data dependence which constrains
> instruction scheduling and retirement.  Indeed a mispredicted branch
> will likely be more costly.

There's also the question how the caller is using the return value. In all
likelihood, the caller branches on it, so making isWhitespace branchless
just moves the misprediction cost to the caller.

On x86, we should be aiming to produce the BT instruction. GIMPLE reassoc
nicely transforms multiple branches into a bit test, but unfortunately it
uses right shift, while RTL matches for a left shift, but not right..
With hand-written code it's easy to make GCC produce BT as desired:

void is_ws_cb(unsigned char c, void f(void))
{
        unsigned long long mask = 1ll<<' ' | 1<<'\t' | 1<<'\r' | 1<<'\n';
        if (c <= 32 && (mask & (1ll<<c)))
            f();
}

        cmpb    $32, %dil
        ja      .L5
        movabsq $4294977024, %rax
        btq     %rdi, %rax
        jnc     .L5
        jmp     *%rsi
.L5:
        ret

In PR 96633 I also outline how an efficient branchless code could look like.

Alexander

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Peephole optimisation: isWhitespace()
  2020-08-24  7:48     ` Richard Biener
  2020-08-24  8:11       ` Alexander Monakov
@ 2020-08-24 11:19       ` Stefan Kanthak
  2020-08-24 12:30         ` Richard Biener
  1 sibling, 1 reply; 13+ messages in thread
From: Stefan Kanthak @ 2020-08-24 11:19 UTC (permalink / raw)
  To: Richard Biener; +Cc: Allan Sandfeld Jensen, GCC Development

"Richard Biener" <richard.guenther@gmail.com> wrote:

> On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>
>> "Allan Sandfeld Jensen" <linux@carewolf.com> wrote:
>>
>> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
>> >> Hi @ll,
>> >>
>> >> in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>,
>> >> Matt Godbolt used the function
>> >>
>> >> | bool isWhitespace(char c)
>> >> | {
>> >> |
>> >> |     return c == ' '
>> >> |
>> >> |       || c == '\r'
>> >> |       || c == '\n'
>> >> |       || c == '\t';
>> >> |
>> >> | }
>> >>
>> >> as an example, for which GCC 9.1 emits the following assembly for AMD64
>> >>
>> >> processors (see <https://godbolt.org/z/acm19_conds>):
>> >> |    xor    eax, eax              ; result = false
>> >> |    cmp    dil, 32               ; is c > 32
>> >> |    ja     .L4                   ; if so, exit with false
>> >> |    movabs rax, 4294977024       ; rax = 0x100002600
>> >> |    shrx   rax, rax, rdi         ; rax >>= c
>> >> |    and    eax, 1                ; result = rax & 1
>> >> |
>> >> |.L4:
>> >> |    ret

[...]
 
> Whether or not the conditional branch sequence is faster depends on whether
> the branch is well-predicted which very much depends on the data you
> feed the isWhitespace function with

Correct.

> but I guess since this is the c == ' ' test it _will_ be a well-predicted branch

Also correct, but you miss a point: the typical use case is

    while (isWhitespace(*ptr)) ptr++;

> which means the conditional branch sequence will be usually faster.

And this is wrong!
The (well-predicted) branch is usually NOT taken, so both code variants
usually execute (with one exception the same) 6 or 7 instructions.

> The proposed change turns the control into a data dependence which
> constrains instruction scheduling and retirement.

It doesn't matter: the branch has the same data dependency too!

> Indeed a mispredicted branch will likely be more costly.

And no branch is even better: the branch predictor has a limited capacity,
so every removed branch instruction can help improve its efficiency.

> x86 CPUs do not perform data speculation.

>>          mov     ecx, edi
>>          movabs  rax, 4294977024
>>          shr     rax, cl
>>          xor     edi, edi
>>          cmp     ecx, 33
>>          setb    dil
>>          and     eax, edi

I already presented measured numbers: with random data, the branch-free
code is faster, with ordered data the original code.

Left column 1 billion sequential characters
    for (int i=1000000000; i; --i) ...(i);
right column 1 billion random characters, in cycles per character:

GCC:           2.4    3.4
branch-free:   3.0    2.5

Now perform a linear interpolation and find the break-even point at
p=0.4, with p=0 for ordered data and p=1 for random data, or just use
the average of these numbers: 2.9 cycles vs. 2.75 cycles.
That's small, but measurable!

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Peephole optimisation: isWhitespace()
  2020-08-24 11:19       ` Stefan Kanthak
@ 2020-08-24 12:30         ` Richard Biener
  2020-08-24 13:49           ` Stefan Kanthak
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Biener @ 2020-08-24 12:30 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Allan Sandfeld Jensen, GCC Development

On Mon, Aug 24, 2020 at 1:22 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>
> "Richard Biener" <richard.guenther@gmail.com> wrote:
>
> > On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
> >>
> >> "Allan Sandfeld Jensen" <linux@carewolf.com> wrote:
> >>
> >> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:
> >> >> Hi @ll,
> >> >>
> >> >> in his ACM queue article <https://queue.acm.org/detail.cfm?id=3372264>,
> >> >> Matt Godbolt used the function
> >> >>
> >> >> | bool isWhitespace(char c)
> >> >> | {
> >> >> |
> >> >> |     return c == ' '
> >> >> |
> >> >> |       || c == '\r'
> >> >> |       || c == '\n'
> >> >> |       || c == '\t';
> >> >> |
> >> >> | }
> >> >>
> >> >> as an example, for which GCC 9.1 emits the following assembly for AMD64
> >> >>
> >> >> processors (see <https://godbolt.org/z/acm19_conds>):
> >> >> |    xor    eax, eax              ; result = false
> >> >> |    cmp    dil, 32               ; is c > 32
> >> >> |    ja     .L4                   ; if so, exit with false
> >> >> |    movabs rax, 4294977024       ; rax = 0x100002600
> >> >> |    shrx   rax, rax, rdi         ; rax >>= c
> >> >> |    and    eax, 1                ; result = rax & 1
> >> >> |
> >> >> |.L4:
> >> >> |    ret
>
> [...]
>
> > Whether or not the conditional branch sequence is faster depends on whether
> > the branch is well-predicted which very much depends on the data you
> > feed the isWhitespace function with
>
> Correct.
>
> > but I guess since this is the c == ' ' test it _will_ be a well-predicted branch
>
> Also correct, but you miss a point: the typical use case is
>
>     while (isWhitespace(*ptr)) ptr++;
>
> > which means the conditional branch sequence will be usually faster.
>
> And this is wrong!
> The (well-predicted) branch is usually NOT taken, so both code variants
> usually execute (with one exception the same) 6 or 7 instructions.

Whether or not the branch is predicted taken does not matter, what
matters is that the continuation is not data dependent on the branch
target computation and thus can execute in parallel to it.

> > The proposed change turns the control into a data dependence which
> > constrains instruction scheduling and retirement.
>
> It doesn't matter: the branch has the same data dependency too!
>
> > Indeed a mispredicted branch will likely be more costly.
>
> And no branch is even better: the branch predictor has a limited capacity,
> so every removed branch instruction can help improve its efficiency.
>
> > x86 CPUs do not perform data speculation.
>
> >>          mov     ecx, edi
> >>          movabs  rax, 4294977024
> >>          shr     rax, cl
> >>          xor     edi, edi
> >>          cmp     ecx, 33
> >>          setb    dil
> >>          and     eax, edi
>
> I already presented measured numbers: with random data, the branch-free
> code is faster, with ordered data the original code.
>
> Left column 1 billion sequential characters
>     for (int i=1000000000; i; --i) ...(i);
> right column 1 billion random characters, in cycles per character:

I guess feeding it Real Text (TM) is the only relevant benchmark,
doing sth like

  for (;;)
     cnt[isWhitespace(*ptr++)]++;

> GCC:           2.4    3.4
> branch-free:   3.0    2.5

I'd call that unconclusive data - you also failed to show your test data
is somehow relevant.  We do know that mispredicted branches are bad.
You show well-predicted branches are good.  By simple statistics
singling out 4 out of 255 values will make the branches well-predicted.

> Now perform a linear interpolation and find the break-even point at
> p=0.4, with p=0 for ordered data and p=1 for random data, or just use
> the average of these numbers: 2.9 cycles vs. 2.75 cycles.
> That's small, but measurable!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Peephole optimisation: isWhitespace()
  2020-08-24 12:30         ` Richard Biener
@ 2020-08-24 13:49           ` Stefan Kanthak
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Kanthak @ 2020-08-24 13:49 UTC (permalink / raw)
  To: Richard Biener; +Cc: Allan Sandfeld Jensen, GCC Development

"Richard Biener" <richard.guenther@gmail.com> wrote:


> On Mon, Aug 24, 2020 at 1:22 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>
>> "Richard Biener" <richard.guenther@gmail.com> wrote:
>>
>> > On Mon, Aug 17, 2020 at 7:09 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>> >>
>> >> "Allan Sandfeld Jensen" <linux@carewolf.com> wrote:
>> >>
>> >> > On Freitag, 14. August 2020 18:43:12 CEST Stefan Kanthak wrote:

[...]

> Whether or not the branch is predicted taken does not matter, what
> matters is that the continuation is not data dependent on the branch
> target computation and thus can execute in parallel to it.

My benchmark shows that this doesn't matter!

>> > The proposed change turns the control into a data dependence which
>> > constrains instruction scheduling and retirement.
>>
>> It doesn't matter: the branch has the same data dependency too!
>>
>> > Indeed a mispredicted branch will likely be more costly.
>>
>> And no branch is even better: the branch predictor has a limited capacity,
>> so every removed branch instruction can help improve its efficiency.
>>
>> > x86 CPUs do not perform data speculation.
>>
>> >>          mov     ecx, edi
>> >>          movabs  rax, 4294977024
>> >>          shr     rax, cl
>> >>          xor     edi, edi
>> >>          cmp     ecx, 33
>> >>          setb    dil
>> >>          and     eax, edi
>>
>> I already presented measured numbers: with random data, the branch-free
>> code is faster, with ordered data the original code.
>>
>> Left column 1 billion sequential characters
>>     for (int i=1000000000; i; --i) ...(i);
>> right column 1 billion random characters, in cycles per character:
> 
> I guess feeding it Real Text (TM) is the only relevant benchmark,
> doing sth like
> 
>  for (;;)
>     cnt[isWhitespace(*ptr++)]++;

I approximated that using a PRNG...

>> GCC:           2.4    3.4
>> branch-free:   3.0    2.5
> 
> I'd call that unconclusive data - you also failed to show your test data
> is somehow relevant.

Since nobody can predict real world data all test data are irrelevant,
somehow. I thus call your argument a NULL argument.

> We do know that mispredicted branches are bad.
> You show well-predicted branches are good.

Wrong: I show that no branches are still better.

> By simple statistics singling out 4 out of 255 values will make the
> branches well-predicted.

Your statistic is wrong:
1. the branch singles out 224 of 256 values, i.e. 7/8 of all data;
2. whitespace lies in the 1/8 which is not singled out.

>> Now perform a linear interpolation and find the break-even point at
>> p=0.4, with p=0 for ordered data and p=1 for random data, or just use
>> the average of these numbers: 2.9 cycles vs. 2.75 cycles.
>> That's small, but measurable!

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Peephole optimisation: isWhitespace()
@ 2020-08-25 21:00 Stefan Kanthak
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Kanthak @ 2020-08-25 21:00 UTC (permalink / raw)
  To: Richard Biener; +Cc: Allan Sandfeld Jensen, GCC Development

I wrote:

> "Richard Biener" <richard.guenther@gmail.com> wrote:

[...]
 
>> Whether or not the branch is predicted taken does not matter, what
>> matters is that the continuation is not data dependent on the branch
>> target computation and thus can execute in parallel to it.
> 
> My benchmark shows that this doesn't matter!

>>> I already presented measured numbers: with random data, the branch-free
>>> code is faster, with ordered data the original code.

[...]

>> We do know that mispredicted branches are bad.
>> You show well-predicted branches are good.
> 
> Wrong: I show that no branches are still better.
> 
>> By simple statistics singling out 4 out of 255 values will make the
>> branches well-predicted.
> 
> Your statistic is wrong:
> 1. the branch singles out 224 of 256 values, i.e. 7/8 of all data;
> 2. whitespace lies in the 1/8 which is not singled out.
> 
>>> Now perform a linear interpolation and find the break-even point at
>>> p=0.4, with p=0 for ordered data and p=1 for random data, or just use
>>> the average of these numbers: 2.9 cycles vs. 2.75 cycles.
>>> That's small, but measurable!

I recommend you ALL start to read Daniel Lemire's article
<https://www.infoq.com/articles/making-code-faster-taming-branches/>

| Even when it is impossible to remove all branches, reducing the number
| of branches "almost always taken" or "almost never taken" may help the
| processor better predict the remaining branches.

JFTR: I didn't know his article before, but I hope that you are willing
      to learn.

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-08-25 21:04 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-14 16:43 Peephole optimisation: isWhitespace() Stefan Kanthak
2020-08-15 21:41 ` Nathan Sidwell
2020-08-16 13:54   ` Stefan Kanthak
2020-08-16 20:18     ` Nathan Sidwell
2020-08-17 13:28       ` Stefan Kanthak
2020-08-17 14:31 ` Allan Sandfeld Jensen
2020-08-17 17:07   ` Stefan Kanthak
2020-08-24  7:48     ` Richard Biener
2020-08-24  8:11       ` Alexander Monakov
2020-08-24 11:19       ` Stefan Kanthak
2020-08-24 12:30         ` Richard Biener
2020-08-24 13:49           ` Stefan Kanthak
2020-08-25 21:00 Stefan Kanthak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).