* Who cares about performance (or Intel's CPU errata)?
@ 2023-05-27 21:20 Stefan Kanthak
2023-05-27 21:47 ` Andrew Pinski
0 siblings, 1 reply; 9+ messages in thread
From: Stefan Kanthak @ 2023-05-27 21:20 UTC (permalink / raw)
To: gcc
Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:
--- dontcare.c ---
int ispowerof2(unsigned __int128 argument) {
return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1;
}
--- EOF ---
GCC 13.3 gcc -march=haswell -O3
https://gcc.godbolt.org/z/PPzYsPzMc
ispowerof2(unsigned __int128):
popcnt rdi, rdi
popcnt rsi, rsi
add esi, edi
xor eax, eax
cmp esi, 1
sete al
ret
OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output?
See https://gcc.godbolt.org/z/jdjTc3EET for comparison!
FIX YOUR BUGS, KIDS!
Stefan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Who cares about performance (or Intel's CPU errata)?
2023-05-27 21:20 Who cares about performance (or Intel's CPU errata)? Stefan Kanthak
@ 2023-05-27 21:47 ` Andrew Pinski
2023-05-27 22:52 ` Stefan Kanthak
0 siblings, 1 reply; 9+ messages in thread
From: Andrew Pinski @ 2023-05-27 21:47 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: gcc
On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>
> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:
>
> --- dontcare.c ---
> int ispowerof2(unsigned __int128 argument) {
> return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1;
> }
> --- EOF ---
>
> GCC 13.3 gcc -march=haswell -O3
>
> https://gcc.godbolt.org/z/PPzYsPzMc
> ispowerof2(unsigned __int128):
> popcnt rdi, rdi
> popcnt rsi, rsi
> add esi, edi
> xor eax, eax
> cmp esi, 1
> sete al
> ret
>
> OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output?
Because the popcount is going to the same register, there is no false
dependency ....
The false dependency errata only applies if the result of the popcnt
is going to a different register, the processor thinks it depends on
the result in that register from a previous instruction but it does
not (which is why it is called a false dependency). In this case it
actually does depend on the previous result since the input is the
same as the input.
Thanks,
Andrew
>
> See https://gcc.godbolt.org/z/jdjTc3EET for comparison!
>
> FIX YOUR BUGS, KIDS!
>
> Stefan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Who cares about performance (or Intel's CPU errata)?
2023-05-27 21:47 ` Andrew Pinski
@ 2023-05-27 22:52 ` Stefan Kanthak
2023-05-27 23:18 ` Mark Wielaard
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Stefan Kanthak @ 2023-05-27 22:52 UTC (permalink / raw)
To: Andrew Pinski; +Cc: gcc
"Andrew Pinski" <pinskia@gmail.com> wrote:
> On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>
>> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:
>>
>> --- dontcare.c ---
>> int ispowerof2(unsigned __int128 argument) {
>> return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1;
>> }
>> --- EOF ---
>>
>> GCC 13.3 gcc -march=haswell -O3
>>
>> https://gcc.godbolt.org/z/PPzYsPzMc
>> ispowerof2(unsigned __int128):
>> popcnt rdi, rdi
>> popcnt rsi, rsi
>> add esi, edi
>> xor eax, eax
>> cmp esi, 1
>> sete al
>> ret
>>
>> OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output?
>
> Because the popcount is going to the same register, there is no false
> dependency ....
> The false dependency errata only applies if the result of the popcnt
> is going to a different register, the processor thinks it depends on
> the result in that register from a previous instruction but it does
> not (which is why it is called a false dependency). In this case it
> actually does depend on the previous result since the input is the
> same as the input.
OUCH, my fault; sorry for the confusion and the wrong accusation.
Nevertheless GCC fails to optimise code properly:
--- .c ---
int ispowerof2(unsigned long long argument) {
return __builtin_popcountll(argument) == 1;
}
--- EOF ---
GCC 13.3 gcc -m32 -mpopcnt -O3
https://godbolt.org/z/fT7a7jP4e
ispowerof2(unsigned long long):
xor eax, eax
xor edx, edx
popcnt eax, [esp+4]
popcnt edx, [esp+8]
add eax, edx # eax is less than 64!
cmp eax, 1 -> dec eax # 2 bytes shorter
sete al
movzx eax, al # superfluous
ret
5 bytes and 1 instruction saved; 5 bytes here and there accumulate to
kilo- or even megabytes, and they can extend code to cross a cache line
or a 16-byte alignment boundary.
JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument
JFTR: GCC is notorious for generating superfluous MOVZX instructions
where its optimiser SHOULD be able see that the value is already
less than 256!
Stefan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Who cares about performance (or Intel's CPU errata)?
2023-05-27 22:52 ` Stefan Kanthak
@ 2023-05-27 23:18 ` Mark Wielaard
2023-05-27 23:30 ` Andrew Pinski
2023-05-28 6:40 ` Who cares about performance (or Intel's CPU errata)? Nicholas Vinson
2 siblings, 0 replies; 9+ messages in thread
From: Mark Wielaard @ 2023-05-27 23:18 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: Andrew Pinski, gcc
Hi Stefan,
On Sun, May 28, 2023 at 12:52:30AM +0200, Stefan Kanthak wrote:
> >> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:
This is totally not constructive. You should stop attacking others
like that on this list.
> OUCH, my fault; sorry for the confusion and the wrong accusation.
Thanks for apologizing.
There is a system for reporting bugs. As others have requested, please
use that: https://gcc.gnu.org/bugs/#where
Please stop ranting on this list and report bugs in bugzilla. Or you
might get banned from the list.
Thanks,
Mark
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Who cares about performance (or Intel's CPU errata)?
2023-05-27 22:52 ` Stefan Kanthak
2023-05-27 23:18 ` Mark Wielaard
@ 2023-05-27 23:30 ` Andrew Pinski
2023-05-28 7:47 ` Stefan Kanthak
` (2 more replies)
2023-05-28 6:40 ` Who cares about performance (or Intel's CPU errata)? Nicholas Vinson
2 siblings, 3 replies; 9+ messages in thread
From: Andrew Pinski @ 2023-05-27 23:30 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: gcc
On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>
> "Andrew Pinski" <pinskia@gmail.com> wrote:
>
> > On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
> >>
> >> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:
> >>
> >> --- dontcare.c ---
> >> int ispowerof2(unsigned __int128 argument) {
> >> return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1;
> >> }
> >> --- EOF ---
> >>
> >> GCC 13.3 gcc -march=haswell -O3
> >>
> >> https://gcc.godbolt.org/z/PPzYsPzMc
> >> ispowerof2(unsigned __int128):
> >> popcnt rdi, rdi
> >> popcnt rsi, rsi
> >> add esi, edi
> >> xor eax, eax
> >> cmp esi, 1
> >> sete al
> >> ret
> >>
> >> OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output?
> >
> > Because the popcount is going to the same register, there is no false
> > dependency ....
> > The false dependency errata only applies if the result of the popcnt
> > is going to a different register, the processor thinks it depends on
> > the result in that register from a previous instruction but it does
> > not (which is why it is called a false dependency). In this case it
> > actually does depend on the previous result since the input is the
> > same as the input.
>
> OUCH, my fault; sorry for the confusion and the wrong accusation.
>
> Nevertheless GCC fails to optimise code properly:
>
> --- .c ---
> int ispowerof2(unsigned long long argument) {
> return __builtin_popcountll(argument) == 1;
> }
> --- EOF ---
>
> GCC 13.3 gcc -m32 -mpopcnt -O3
>
> https://godbolt.org/z/fT7a7jP4e
> ispowerof2(unsigned long long):
> xor eax, eax
> xor edx, edx
> popcnt eax, [esp+4]
> popcnt edx, [esp+8]
> add eax, edx # eax is less than 64!
> cmp eax, 1 -> dec eax # 2 bytes shorter
dec eax is done for -Os already. -O2 means performance, it does not
mean decrease size. dec can be slower as it can create a false
dependency and it requires eax register to be not alive at the end of
the statement. and IIRC for x86 decode, it could cause 2 (not 1)
micro-ops.
> sete al
> movzx eax, al # superfluous
No it is not superfluous, well ok it is because of the context of eax
(besides the lower 8 bits) are already zero'd but keeping that track
is a hard problem and is turning problem really. And I suspect it
would cause another false dependency later on too.
For -Os -march=skylake (and -Oz instead of -Os) we get:
popcnt rdi, rdi
popcnt rsi, rsi
add esi, edi
xor eax, eax
dec esi
sete al
Which is exactly what you want right?
Thanks,
Andrew
> ret
>
> 5 bytes and 1 instruction saved; 5 bytes here and there accumulate to
> kilo- or even megabytes, and they can extend code to cross a cache line
> or a 16-byte alignment boundary.
>
> JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument
>
> JFTR: GCC is notorious for generating superfluous MOVZX instructions
> where its optimiser SHOULD be able see that the value is already
> less than 256!
>
> Stefan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Who cares about performance (or Intel's CPU errata)?
2023-05-27 22:52 ` Stefan Kanthak
2023-05-27 23:18 ` Mark Wielaard
2023-05-27 23:30 ` Andrew Pinski
@ 2023-05-28 6:40 ` Nicholas Vinson
2 siblings, 0 replies; 9+ messages in thread
From: Nicholas Vinson @ 2023-05-28 6:40 UTC (permalink / raw)
To: gcc
On 5/27/23 18:52, Stefan Kanthak wrote:
> "Andrew Pinski" <pinskia@gmail.com> wrote:
>
>> On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:
>>>
>>> --- dontcare.c ---
>>> int ispowerof2(unsigned __int128 argument) {
>>> return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1;
>>> }
>>> --- EOF ---
>>>
>>> GCC 13.3 gcc -march=haswell -O3
>>>
>>> https://gcc.godbolt.org/z/PPzYsPzMc
>>> ispowerof2(unsigned __int128):
>>> popcnt rdi, rdi
>>> popcnt rsi, rsi
>>> add esi, edi
>>> xor eax, eax
>>> cmp esi, 1
>>> sete al
>>> ret
>>>
>>> OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output?
>> Because the popcount is going to the same register, there is no false
>> dependency ....
>> The false dependency errata only applies if the result of the popcnt
>> is going to a different register, the processor thinks it depends on
>> the result in that register from a previous instruction but it does
>> not (which is why it is called a false dependency). In this case it
>> actually does depend on the previous result since the input is the
>> same as the input.
> OUCH, my fault; sorry for the confusion and the wrong accusation.
>
> Nevertheless GCC fails to optimise code properly:
>
> --- .c ---
> int ispowerof2(unsigned long long argument) {
> return __builtin_popcountll(argument) == 1;
> }
> --- EOF ---
>
> GCC 13.3 gcc -m32 -mpopcnt -O3
>
> https://godbolt.org/z/fT7a7jP4e
> ispowerof2(unsigned long long):
> xor eax, eax
> xor edx, edx
> popcnt eax, [esp+4]
> popcnt edx, [esp+8]
> add eax, edx # eax is less than 64!
Less than or equal to 64 (consider the case when input is (unsigned long
long)-1)
> cmp eax, 1 -> dec eax # 2 bytes shorter
> sete al
> movzx eax, al # superfluous
Not when dec is used. Use dec and omit this instruction, you may get a
result value of 0xffffff00 (consider the case when input is (unsigned
long long)0).
> ret
>
> 5 bytes and 1 instruction saved; 5 bytes here and there accumulate to
> kilo- or even megabytes, and they can extend code to cross a cache line
> or a 16-byte alignment boundary.
>
> JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument
>
> JFTR: GCC is notorious for generating superfluous MOVZX instructions
> where its optimiser SHOULD be able see that the value is already
> less than 256!
>
> Stefan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Who cares about performance (or Intel's CPU errata)?
2023-05-27 23:30 ` Andrew Pinski
@ 2023-05-28 7:47 ` Stefan Kanthak
2023-05-28 12:29 ` David Brown
2023-05-29 18:11 ` Who cares about size? (was: Who cares about performance (or Intel's CPU errata)?) Stefan Kanthak
2 siblings, 0 replies; 9+ messages in thread
From: Stefan Kanthak @ 2023-05-28 7:47 UTC (permalink / raw)
To: Andrew Pinski; +Cc: gcc
"Andrew Pinski" <pinskia@gmail.com> wrote:
> On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
[...]
>> Nevertheless GCC fails to optimise code properly:
>>
>> --- .c ---
>> int ispowerof2(unsigned long long argument) {
>> return __builtin_popcountll(argument) == 1;
>> }
>> --- EOF ---
>>
>> GCC 13.3 gcc -m32 -mpopcnt -O3
>>
>> https://godbolt.org/z/fT7a7jP4e
>> ispowerof2(unsigned long long):
>> xor eax, eax
>> xor edx, edx
>> popcnt eax, [esp+4]
>> popcnt edx, [esp+8]
>> add eax, edx # eax is less than 64!
>> cmp eax, 1 -> dec eax # 2 bytes shorter
>
> dec eax is done for -Os already. -O2 means performance, it does not
> mean decrease size. dec can be slower as it can create a false
> dependency and it requires eax register to be not alive at the end of
> the statement. and IIRC for x86 decode, it could cause 2 (not 1)
> micro-ops.
It CAN, it COULD, but is does NOT NEED to: it all depends on the target
processor. Shall I add an example with -march=<not affected processor>?
>> sete al
Depending on the target processor the partial register can also harm
the performance.
Did you forget to mention that too?
>> movzx eax, al # superfluous
>
> No it is not superfluous, well ok it is because of the context of eax
> (besides the lower 8 bits) are already zero'd
Correct.
The same holds for example for PMOVMSKB when the high(er) lane(s) of
the source [XYZ]MM register are (known to be) 0, for example after MOVQ;
that's what GCC also fails to track.
> but keeping that track is a hard problem and is turning problem really.
Aren't such problems just there to be solved?
> And I suspect it would cause another false dependency later on too.
All these quirks can be avoided with the following 6-byte code sequence
(same size as SETcc plus MOVZX) I used in one of my previous posts to
fold any non-zero value to 1:
neg eax
sbb eax, eax
neg eax
No partial register writes, no false dependencies, no INC/DEC subleties.
JFTR: AMD documents that SBB with same destination and source is handled
in the register renamer; I suspect Intel processors do it too,
albeit not documented.
> For -Os -march=skylake (and -Oz instead of -Os) we get:
> popcnt rdi, rdi
> popcnt rsi, rsi
> add esi, edi
> xor eax, eax
> dec esi
> sete al
>
> Which is exactly what you want right?
Yes.
For -m32 -Os/-Oz, AND if CDQ breaks the dependency, it should be
xor eax, eax
xor edx, edx -> cdq # 1 byte shorter
popcnt eax, [esp+4]
popcnt edx, [esp+8]
add eax, edx # eax is less than 64!
cmp eax, 1 -> dec eax # 2 bytes shorter
On AMD64 DEC <r32> is a 2-byte instruction; the following alternative code
avoids its potential false dependency as well as other possible quirks,
and also suits -Ot, -O2 and -O3 on processors where the register renamer
handles the XOR:
popcnt rdi, rdi
popcnt rsi, rsi
xor eax, eax
not edi # edi = -(edi + 1)
sub edi, esi # edi = -(edi + 1 + esi)
setz al
For processors where the register renamer doesn't "execute" XOR, but MOV,
the following code is an alternative for -Ot, -O2 and -O3:
popcnt rdi, rdi
popcnt rsi, rsi
mov eax, edi
add eax, esi
cmp eax, 1
setz al
Stefan
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Who cares about performance (or Intel's CPU errata)?
2023-05-27 23:30 ` Andrew Pinski
2023-05-28 7:47 ` Stefan Kanthak
@ 2023-05-28 12:29 ` David Brown
2023-05-29 18:11 ` Who cares about size? (was: Who cares about performance (or Intel's CPU errata)?) Stefan Kanthak
2 siblings, 0 replies; 9+ messages in thread
From: David Brown @ 2023-05-28 12:29 UTC (permalink / raw)
To: gcc
On 28/05/2023 01:30, Andrew Pinski via Gcc wrote:
> On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>>
>> sete al
>> movzx eax, al # superfluous
>
> No it is not superfluous, well ok it is because of the context of eax
> (besides the lower 8 bits) are already zero'd but keeping that track
> is a hard problem and is turning problem really. And I suspect it
> would cause another false dependency later on too.
>
> For -Os -march=skylake (and -Oz instead of -Os) we get:
> popcnt rdi, rdi
> popcnt rsi, rsi
> add esi, edi
> xor eax, eax
> dec esi
> sete al
>
> Which is exactly what you want right?
>
> Thanks,
> Andrew
>
There is also the option of using "bool" as the return type for boolean
functions, rather than "int". When returning a "bool", gcc does not add
the "movzx eax, al" instruction. (There are some circumstances where
returning "int" for a boolean value is a better choice, but usually
"bool" makes more sense, and it can often be a touch more efficient.)
David
^ permalink raw reply [flat|nested] 9+ messages in thread
* Who cares about size? (was: Who cares about performance (or Intel's CPU errata)?)
2023-05-27 23:30 ` Andrew Pinski
2023-05-28 7:47 ` Stefan Kanthak
2023-05-28 12:29 ` David Brown
@ 2023-05-29 18:11 ` Stefan Kanthak
2 siblings, 0 replies; 9+ messages in thread
From: Stefan Kanthak @ 2023-05-29 18:11 UTC (permalink / raw)
To: Andrew Pinski; +Cc: gcc
"Andrew Pinski" <pinskia@gmail.com> wrote:
> On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote:
>> Nevertheless GCC fails to optimise code properly:
>>
>> --- .c ---
>> int ispowerof2(unsigned long long argument) {
>> return __builtin_popcountll(argument) == 1;
>> }
>> --- EOF ---
>>
>> GCC 13.3 gcc -m32 -mpopcnt -O3
>>
>> https://godbolt.org/z/fT7a7jP4e
>> ispowerof2(unsigned long long):
>> xor eax, eax
>> xor edx, edx
>> popcnt eax, [esp+4]
>> popcnt edx, [esp+8]
>> add eax, edx # eax is less than 64!
>> cmp eax, 1 -> dec eax # 2 bytes shorter
sete al
movzx eax, al
ret
>
> dec eax is done for -Os already.
> -O2 means performance, it does not mean decrease size.
But -Os is supposed to optimise for size? REALITY CHECK:
GCC 13.3 gcc -m32 -mpopcnt -Os
https://godbolt.org/z/41Ed6rr6r
ispowerof2(unsigned long long):
push ebp
mov ebp, esp
sub esp, 16
push [ebp+12]
push [ebp+8]
call __popcountdi2
add esp, 16 # superfluous!
leave
dec eax
sete al
movzx eax, al
ret
While -O3 generates 9 instructions in a total of 24 bytes,
-Os[lowmotion] but generates 12 instructions in 29 bytes,
PLUS the uncounted instructions/bytes of __popcountdi2()!
This is what I call an EPIC FAILURE!
Stefan
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-05-29 18:19 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-27 21:20 Who cares about performance (or Intel's CPU errata)? Stefan Kanthak
2023-05-27 21:47 ` Andrew Pinski
2023-05-27 22:52 ` Stefan Kanthak
2023-05-27 23:18 ` Mark Wielaard
2023-05-27 23:30 ` Andrew Pinski
2023-05-28 7:47 ` Stefan Kanthak
2023-05-28 12:29 ` David Brown
2023-05-29 18:11 ` Who cares about size? (was: Who cares about performance (or Intel's CPU errata)?) Stefan Kanthak
2023-05-28 6:40 ` Who cares about performance (or Intel's CPU errata)? Nicholas Vinson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).