public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Who cares about performance (or Intel's CPU errata)?
@ 2023-05-27 21:20 Stefan Kanthak
  2023-05-27 21:47 ` Andrew Pinski
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Kanthak @ 2023-05-27 21:20 UTC (permalink / raw)
  To: gcc

Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:

--- dontcare.c ---
int ispowerof2(unsigned __int128 argument) {
    return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1;
}
--- EOF ---

GCC 13.3    gcc -march=haswell -O3

https://gcc.godbolt.org/z/PPzYsPzMc
ispowerof2(unsigned __int128):
        popcnt  rdi, rdi
        popcnt  rsi, rsi
        add     esi, edi
        xor     eax, eax
        cmp     esi, 1
        sete    al
        ret

OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output?

See https://gcc.godbolt.org/z/jdjTc3EET for comparison!

FIX YOUR BUGS, KIDS!

Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Who cares about performance (or Intel's CPU errata)?
  2023-05-27 21:20 Who cares about performance (or Intel's CPU errata)? Stefan Kanthak
@ 2023-05-27 21:47 ` Andrew Pinski
  2023-05-27 22:52   ` Stefan Kanthak
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Pinski @ 2023-05-27 21:47 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: gcc

On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>
> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:
>
> --- dontcare.c ---
> int ispowerof2(unsigned __int128 argument) {
>     return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1;
> }
> --- EOF ---
>
> GCC 13.3    gcc -march=haswell -O3
>
> https://gcc.godbolt.org/z/PPzYsPzMc
> ispowerof2(unsigned __int128):
>         popcnt  rdi, rdi
>         popcnt  rsi, rsi
>         add     esi, edi
>         xor     eax, eax
>         cmp     esi, 1
>         sete    al
>         ret
>
> OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output?

Because the popcount is going to the same register, there is no false
dependency ....
The false dependency errata only applies if the result of the popcnt
is going to a different register, the processor thinks it depends on
the result in that register from a previous instruction but it does
not (which is why it is called a false dependency). In this case it
actually does depend on the previous result since the input is the
same as the input.

Thanks,
Andrew

>
> See https://gcc.godbolt.org/z/jdjTc3EET for comparison!
>
> FIX YOUR BUGS, KIDS!
>
> Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Who cares about performance (or Intel's CPU errata)?
  2023-05-27 21:47 ` Andrew Pinski
@ 2023-05-27 22:52   ` Stefan Kanthak
  2023-05-27 23:18     ` Mark Wielaard
                       ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Stefan Kanthak @ 2023-05-27 22:52 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc

"Andrew Pinski" <pinskia@gmail.com> wrote:

> On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>
>> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:
>>
>> --- dontcare.c ---
>> int ispowerof2(unsigned __int128 argument) {
>>     return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1;
>> }
>> --- EOF ---
>>
>> GCC 13.3    gcc -march=haswell -O3
>>
>> https://gcc.godbolt.org/z/PPzYsPzMc
>> ispowerof2(unsigned __int128):
>>         popcnt  rdi, rdi
>>         popcnt  rsi, rsi
>>         add     esi, edi
>>         xor     eax, eax
>>         cmp     esi, 1
>>         sete    al
>>         ret
>>
>> OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output?
>
> Because the popcount is going to the same register, there is no false
> dependency ....
> The false dependency errata only applies if the result of the popcnt
> is going to a different register, the processor thinks it depends on
> the result in that register from a previous instruction but it does
> not (which is why it is called a false dependency). In this case it
> actually does depend on the previous result since the input is the
> same as the input.

OUCH, my fault; sorry for the confusion and the wrong accusation.

Nevertheless GCC fails to optimise code properly:

--- .c ---
int ispowerof2(unsigned long long argument) {
    return __builtin_popcountll(argument) == 1;
}
--- EOF ---

GCC 13.3    gcc -m32 -mpopcnt -O3

https://godbolt.org/z/fT7a7jP4e
ispowerof2(unsigned long long):
        xor     eax, eax
        xor     edx, edx
        popcnt  eax, [esp+4]
        popcnt  edx, [esp+8]
        add     eax, edx                 # eax is less than 64!
        cmp     eax, 1    ->    dec eax  # 2 bytes shorter
        sete    al
        movzx   eax, al                  # superfluous
        ret

5 bytes and 1 instruction saved; 5 bytes here and there accumulate to
kilo- or even megabytes, and they can extend code to cross a cache line
or a 16-byte alignment boundary.

JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument

JFTR: GCC is notorious for generating superfluous MOVZX instructions
      where its optimiser SHOULD be able see that the value is already
      less than 256!

Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Who cares about performance (or Intel's CPU errata)?
  2023-05-27 22:52   ` Stefan Kanthak
@ 2023-05-27 23:18     ` Mark Wielaard
  2023-05-27 23:30     ` Andrew Pinski
  2023-05-28  6:40     ` Who cares about performance (or Intel's CPU errata)? Nicholas Vinson
  2 siblings, 0 replies; 9+ messages in thread
From: Mark Wielaard @ 2023-05-27 23:18 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Andrew Pinski, gcc

Hi Stefan,

On Sun, May 28, 2023 at 12:52:30AM +0200, Stefan Kanthak wrote:
> >> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:

This is totally not constructive. You should stop attacking others
like that on this list.

> OUCH, my fault; sorry for the confusion and the wrong accusation.

Thanks for apologizing.

There is a system for reporting bugs. As others have requested, please
use that: https://gcc.gnu.org/bugs/#where

Please stop ranting on this list and report bugs in bugzilla. Or you
might get banned from the list.

Thanks,

Mark

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Who cares about performance (or Intel's CPU errata)?
  2023-05-27 22:52   ` Stefan Kanthak
  2023-05-27 23:18     ` Mark Wielaard
@ 2023-05-27 23:30     ` Andrew Pinski
  2023-05-28  7:47       ` Stefan Kanthak
                         ` (2 more replies)
  2023-05-28  6:40     ` Who cares about performance (or Intel's CPU errata)? Nicholas Vinson
  2 siblings, 3 replies; 9+ messages in thread
From: Andrew Pinski @ 2023-05-27 23:30 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: gcc

On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>
> "Andrew Pinski" <pinskia@gmail.com> wrote:
>
> > On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
> >>
> >> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:
> >>
> >> --- dontcare.c ---
> >> int ispowerof2(unsigned __int128 argument) {
> >>     return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1;
> >> }
> >> --- EOF ---
> >>
> >> GCC 13.3    gcc -march=haswell -O3
> >>
> >> https://gcc.godbolt.org/z/PPzYsPzMc
> >> ispowerof2(unsigned __int128):
> >>         popcnt  rdi, rdi
> >>         popcnt  rsi, rsi
> >>         add     esi, edi
> >>         xor     eax, eax
> >>         cmp     esi, 1
> >>         sete    al
> >>         ret
> >>
> >> OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output?
> >
> > Because the popcount is going to the same register, there is no false
> > dependency ....
> > The false dependency errata only applies if the result of the popcnt
> > is going to a different register, the processor thinks it depends on
> > the result in that register from a previous instruction but it does
> > not (which is why it is called a false dependency). In this case it
> > actually does depend on the previous result since the input is the
> > same as the input.
>
> OUCH, my fault; sorry for the confusion and the wrong accusation.
>
> Nevertheless GCC fails to optimise code properly:
>
> --- .c ---
> int ispowerof2(unsigned long long argument) {
>     return __builtin_popcountll(argument) == 1;
> }
> --- EOF ---
>
> GCC 13.3    gcc -m32 -mpopcnt -O3
>
> https://godbolt.org/z/fT7a7jP4e
> ispowerof2(unsigned long long):
>         xor     eax, eax
>         xor     edx, edx
>         popcnt  eax, [esp+4]
>         popcnt  edx, [esp+8]
>         add     eax, edx                 # eax is less than 64!
>         cmp     eax, 1    ->    dec eax  # 2 bytes shorter

dec eax is done for -Os already. -O2 means performance, it does not
mean decrease size. dec can be slower as it can create a false
dependency and it requires eax register to be not alive at the end of
the statement. and IIRC for x86 decode, it could cause 2 (not 1)
micro-ops.

>         sete    al
>         movzx   eax, al                  # superfluous

No it is not superfluous, well ok it is because of the context of eax
(besides the lower 8 bits) are already zero'd but keeping that track
is a hard problem and is turning problem really. And I suspect it
would cause another false dependency later on too.

For -Os -march=skylake (and -Oz instead of -Os) we get:
        popcnt  rdi, rdi
        popcnt  rsi, rsi
        add     esi, edi
        xor     eax, eax
        dec     esi
        sete    al

Which is exactly what you want right?

Thanks,
Andrew

>         ret
>
> 5 bytes and 1 instruction saved; 5 bytes here and there accumulate to
> kilo- or even megabytes, and they can extend code to cross a cache line
> or a 16-byte alignment boundary.
>
> JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument
>
> JFTR: GCC is notorious for generating superfluous MOVZX instructions
>       where its optimiser SHOULD be able see that the value is already
>       less than 256!
>
> Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Who cares about performance (or Intel's CPU errata)?
  2023-05-27 22:52   ` Stefan Kanthak
  2023-05-27 23:18     ` Mark Wielaard
  2023-05-27 23:30     ` Andrew Pinski
@ 2023-05-28  6:40     ` Nicholas Vinson
  2 siblings, 0 replies; 9+ messages in thread
From: Nicholas Vinson @ 2023-05-28  6:40 UTC (permalink / raw)
  To: gcc


On 5/27/23 18:52, Stefan Kanthak wrote:
> "Andrew Pinski" <pinskia@gmail.com> wrote:
>
>> On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:
>>>
>>> --- dontcare.c ---
>>> int ispowerof2(unsigned __int128 argument) {
>>>      return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1;
>>> }
>>> --- EOF ---
>>>
>>> GCC 13.3    gcc -march=haswell -O3
>>>
>>> https://gcc.godbolt.org/z/PPzYsPzMc
>>> ispowerof2(unsigned __int128):
>>>          popcnt  rdi, rdi
>>>          popcnt  rsi, rsi
>>>          add     esi, edi
>>>          xor     eax, eax
>>>          cmp     esi, 1
>>>          sete    al
>>>          ret
>>>
>>> OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output?
>> Because the popcount is going to the same register, there is no false
>> dependency ....
>> The false dependency errata only applies if the result of the popcnt
>> is going to a different register, the processor thinks it depends on
>> the result in that register from a previous instruction but it does
>> not (which is why it is called a false dependency). In this case it
>> actually does depend on the previous result since the input is the
>> same as the input.
> OUCH, my fault; sorry for the confusion and the wrong accusation.
>
> Nevertheless GCC fails to optimise code properly:
>
> --- .c ---
> int ispowerof2(unsigned long long argument) {
>      return __builtin_popcountll(argument) == 1;
> }
> --- EOF ---
>
> GCC 13.3    gcc -m32 -mpopcnt -O3
>
> https://godbolt.org/z/fT7a7jP4e
> ispowerof2(unsigned long long):
>          xor     eax, eax
>          xor     edx, edx
>          popcnt  eax, [esp+4]
>          popcnt  edx, [esp+8]
>          add     eax, edx                 # eax is less than 64!
Less than or equal to 64 (consider the case when input is (unsigned long 
long)-1)
>          cmp     eax, 1    ->    dec eax  # 2 bytes shorter
>          sete    al
>          movzx   eax, al                  # superfluous
Not when dec is used. Use dec and omit this instruction, you may get a 
result value of 0xffffff00 (consider the case when input is (unsigned 
long long)0).
>          ret
>
> 5 bytes and 1 instruction saved; 5 bytes here and there accumulate to
> kilo- or even megabytes, and they can extend code to cross a cache line
> or a 16-byte alignment boundary.
>
> JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument
>
> JFTR: GCC is notorious for generating superfluous MOVZX instructions
>        where its optimiser SHOULD be able see that the value is already
>        less than 256!
>
> Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Who cares about performance (or Intel's CPU errata)?
  2023-05-27 23:30     ` Andrew Pinski
@ 2023-05-28  7:47       ` Stefan Kanthak
  2023-05-28 12:29       ` David Brown
  2023-05-29 18:11       ` Who cares about size? (was: Who cares about performance (or Intel's CPU errata)?) Stefan Kanthak
  2 siblings, 0 replies; 9+ messages in thread
From: Stefan Kanthak @ 2023-05-28  7:47 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc

"Andrew Pinski" <pinskia@gmail.com> wrote:

> On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:

[...]

>> Nevertheless GCC fails to optimise code properly:
>>
>> --- .c ---
>> int ispowerof2(unsigned long long argument) {
>>     return __builtin_popcountll(argument) == 1;
>> }
>> --- EOF ---
>>
>> GCC 13.3    gcc -m32 -mpopcnt -O3
>>
>> https://godbolt.org/z/fT7a7jP4e
>> ispowerof2(unsigned long long):
>>         xor     eax, eax
>>         xor     edx, edx
>>         popcnt  eax, [esp+4]
>>         popcnt  edx, [esp+8]
>>         add     eax, edx                 # eax is less than 64!
>>         cmp     eax, 1    ->    dec eax  # 2 bytes shorter
>
> dec eax is done for -Os already. -O2 means performance, it does not
> mean decrease size. dec can be slower as it can create a false
> dependency and it requires eax register to be not alive at the end of
> the statement. and IIRC for x86 decode, it could cause 2 (not 1)
> micro-ops.

It CAN, it COULD, but is does NOT NEED to: it all depends on the target
processor. Shall I add an example with -march=<not affected processor>?

>>         sete    al

Depending on the target processor the partial register can also harm
the performance.
Did you forget to mention that too?

>>         movzx   eax, al                  # superfluous
>
> No it is not superfluous, well ok it is because of the context of eax
> (besides the lower 8 bits) are already zero'd

Correct.
The same holds for example for PMOVMSKB when the high(er) lane(s) of
the source [XYZ]MM register are (known to be) 0, for example after MOVQ;
that's what GCC also fails to track.

> but keeping that track is a hard problem and is turning problem really.

Aren't such problems just there to be solved?

> And I suspect it would cause another false dependency later on too.

All these quirks can be avoided with the following 6-byte code sequence
(same size as SETcc plus MOVZX) I used in one of my previous posts to
fold any non-zero value to 1:

        neg    eax
        sbb    eax, eax
        neg    eax

No partial register writes, no false dependencies, no INC/DEC subleties.

JFTR: AMD documents that SBB with same destination and source is handled
      in the register renamer; I suspect Intel processors do it too,
      albeit not documented.

> For -Os -march=skylake (and -Oz instead of -Os) we get:
>        popcnt  rdi, rdi
>        popcnt  rsi, rsi
>        add     esi, edi
>        xor     eax, eax
>        dec     esi
>        sete    al
>
> Which is exactly what you want right?

Yes.
For -m32 -Os/-Oz, AND if CDQ breaks the dependency, it should be

         xor     eax, eax
         xor     edx, edx  ->    cdq      # 1 byte shorter
         popcnt  eax, [esp+4]
         popcnt  edx, [esp+8]
         add     eax, edx                 # eax is less than 64!
         cmp     eax, 1    ->    dec eax  # 2 bytes shorter

On AMD64 DEC <r32> is a 2-byte instruction; the following alternative code
avoids its potential false dependency as well as other possible quirks,
and also suits -Ot, -O2 and -O3 on processors where the register renamer
handles the XOR:

        popcnt  rdi, rdi
        popcnt  rsi, rsi
        xor     eax, eax
        not     edi        # edi = -(edi + 1)
        sub     edi, esi   # edi = -(edi + 1 + esi)
        setz    al

For processors where the register renamer doesn't "execute" XOR, but MOV,
the following code is an alternative for -Ot, -O2 and -O3:

        popcnt  rdi, rdi
        popcnt  rsi, rsi
        mov     eax, edi
        add     eax, esi
        cmp     eax, 1
        setz    al

Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Who cares about performance (or Intel's CPU errata)?
  2023-05-27 23:30     ` Andrew Pinski
  2023-05-28  7:47       ` Stefan Kanthak
@ 2023-05-28 12:29       ` David Brown
  2023-05-29 18:11       ` Who cares about size? (was: Who cares about performance (or Intel's CPU errata)?) Stefan Kanthak
  2 siblings, 0 replies; 9+ messages in thread
From: David Brown @ 2023-05-28 12:29 UTC (permalink / raw)
  To: gcc

On 28/05/2023 01:30, Andrew Pinski via Gcc wrote:
> On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>

>>          sete    al
>>          movzx   eax, al                  # superfluous
> 
> No it is not superfluous, well ok it is because of the context of eax
> (besides the lower 8 bits) are already zero'd but keeping that track
> is a hard problem and is turning problem really. And I suspect it
> would cause another false dependency later on too.
> 
> For -Os -march=skylake (and -Oz instead of -Os) we get:
>          popcnt  rdi, rdi
>          popcnt  rsi, rsi
>          add     esi, edi
>          xor     eax, eax
>          dec     esi
>          sete    al
> 
> Which is exactly what you want right?
> 
> Thanks,
> Andrew
> 
There is also the option of using "bool" as the return type for boolean 
functions, rather than "int".  When returning a "bool", gcc does not add 
the "movzx eax, al" instruction.  (There are some circumstances where 
returning "int" for a boolean value is a better choice, but usually 
"bool" makes more sense, and it can often be a touch more efficient.)

David



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Who cares about size? (was: Who cares about performance (or Intel's CPU errata)?)
  2023-05-27 23:30     ` Andrew Pinski
  2023-05-28  7:47       ` Stefan Kanthak
  2023-05-28 12:29       ` David Brown
@ 2023-05-29 18:11       ` Stefan Kanthak
  2 siblings, 0 replies; 9+ messages in thread
From: Stefan Kanthak @ 2023-05-29 18:11 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc

"Andrew Pinski" <pinskia@gmail.com> wrote:

> On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:

>> Nevertheless GCC fails to optimise code properly:
>>
>> --- .c ---
>> int ispowerof2(unsigned long long argument) {
>>     return __builtin_popcountll(argument) == 1;
>> }
>> --- EOF ---
>>
>> GCC 13.3    gcc -m32 -mpopcnt -O3
>>
>> https://godbolt.org/z/fT7a7jP4e
>> ispowerof2(unsigned long long):
>>         xor     eax, eax
>>         xor     edx, edx
>>         popcnt  eax, [esp+4]
>>         popcnt  edx, [esp+8]
>>         add     eax, edx                 # eax is less than 64!
>>         cmp     eax, 1    ->    dec eax  # 2 bytes shorter
           sete    al
           movzx   eax, al
           ret
>
> dec eax is done for -Os already.
> -O2 means performance, it does not mean decrease size.

But -Os is supposed to optimise for size? REALITY CHECK:

GCC 13.3    gcc -m32 -mpopcnt -Os

https://godbolt.org/z/41Ed6rr6r
ispowerof2(unsigned long long):
        push    ebp
        mov     ebp, esp
        sub     esp, 16
        push    [ebp+12]
        push    [ebp+8]
        call    __popcountdi2
        add     esp, 16        # superfluous!
        leave
        dec     eax
        sete    al
        movzx   eax, al
        ret

While -O3 generates 9 instructions in a total of 24 bytes,
-Os[lowmotion] but generates 12 instructions in 29 bytes,
PLUS the uncounted instructions/bytes of __popcountdi2()!

This is what I call an EPIC FAILURE!

Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-05-29 18:19 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-27 21:20 Who cares about performance (or Intel's CPU errata)? Stefan Kanthak
2023-05-27 21:47 ` Andrew Pinski
2023-05-27 22:52   ` Stefan Kanthak
2023-05-27 23:18     ` Mark Wielaard
2023-05-27 23:30     ` Andrew Pinski
2023-05-28  7:47       ` Stefan Kanthak
2023-05-28 12:29       ` David Brown
2023-05-29 18:11       ` Who cares about size? (was: Who cares about performance (or Intel's CPU errata)?) Stefan Kanthak
2023-05-28  6:40     ` Who cares about performance (or Intel's CPU errata)? Nicholas Vinson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).