* Who cares about performance (or Intel's CPU errata)? @ 2023-05-27 21:20 Stefan Kanthak 2023-05-27 21:47 ` Andrew Pinski 0 siblings, 1 reply; 9+ messages in thread From: Stefan Kanthak @ 2023-05-27 21:20 UTC (permalink / raw) To: gcc Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are: --- dontcare.c --- int ispowerof2(unsigned __int128 argument) { return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1; } --- EOF --- GCC 13.3 gcc -march=haswell -O3 https://gcc.godbolt.org/z/PPzYsPzMc ispowerof2(unsigned __int128): popcnt rdi, rdi popcnt rsi, rsi add esi, edi xor eax, eax cmp esi, 1 sete al ret OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output? See https://gcc.godbolt.org/z/jdjTc3EET for comparison! FIX YOUR BUGS, KIDS! Stefan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Who cares about performance (or Intel's CPU errata)? 2023-05-27 21:20 Who cares about performance (or Intel's CPU errata)? Stefan Kanthak @ 2023-05-27 21:47 ` Andrew Pinski 2023-05-27 22:52 ` Stefan Kanthak 0 siblings, 1 reply; 9+ messages in thread From: Andrew Pinski @ 2023-05-27 21:47 UTC (permalink / raw) To: Stefan Kanthak; +Cc: gcc On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote: > > Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are: > > --- dontcare.c --- > int ispowerof2(unsigned __int128 argument) { > return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1; > } > --- EOF --- > > GCC 13.3 gcc -march=haswell -O3 > > https://gcc.godbolt.org/z/PPzYsPzMc > ispowerof2(unsigned __int128): > popcnt rdi, rdi > popcnt rsi, rsi > add esi, edi > xor eax, eax > cmp esi, 1 > sete al > ret > > OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output? Because the popcount is going to the same register, there is no false dependency .... The false dependency errata only applies if the result of the popcnt is going to a different register, the processor thinks it depends on the result in that register from a previous instruction but it does not (which is why it is called a false dependency). In this case it actually does depend on the previous result since the input is the same as the input. Thanks, Andrew > > See https://gcc.godbolt.org/z/jdjTc3EET for comparison! > > FIX YOUR BUGS, KIDS! > > Stefan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Who cares about performance (or Intel's CPU errata)? 2023-05-27 21:47 ` Andrew Pinski @ 2023-05-27 22:52 ` Stefan Kanthak 2023-05-27 23:18 ` Mark Wielaard ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Stefan Kanthak @ 2023-05-27 22:52 UTC (permalink / raw) To: Andrew Pinski; +Cc: gcc "Andrew Pinski" <pinskia@gmail.com> wrote: > On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote: >> >> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are: >> >> --- dontcare.c --- >> int ispowerof2(unsigned __int128 argument) { >> return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1; >> } >> --- EOF --- >> >> GCC 13.3 gcc -march=haswell -O3 >> >> https://gcc.godbolt.org/z/PPzYsPzMc >> ispowerof2(unsigned __int128): >> popcnt rdi, rdi >> popcnt rsi, rsi >> add esi, edi >> xor eax, eax >> cmp esi, 1 >> sete al >> ret >> >> OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output? > > Because the popcount is going to the same register, there is no false > dependency .... > The false dependency errata only applies if the result of the popcnt > is going to a different register, the processor thinks it depends on > the result in that register from a previous instruction but it does > not (which is why it is called a false dependency). In this case it > actually does depend on the previous result since the input is the > same as the input. OUCH, my fault; sorry for the confusion and the wrong accusation. Nevertheless GCC fails to optimise code properly: --- .c --- int ispowerof2(unsigned long long argument) { return __builtin_popcountll(argument) == 1; } --- EOF --- GCC 13.3 gcc -m32 -mpopcnt -O3 https://godbolt.org/z/fT7a7jP4e ispowerof2(unsigned long long): xor eax, eax xor edx, edx popcnt eax, [esp+4] popcnt edx, [esp+8] add eax, edx # eax is less than 64! cmp eax, 1 -> dec eax # 2 bytes shorter sete al movzx eax, al # superfluous ret 5 bytes and 1 instruction saved; 5 bytes here and there accumulate to kilo- or even megabytes, and they can extend code to cross a cache line or a 16-byte alignment boundary. JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument JFTR: GCC is notorious for generating superfluous MOVZX instructions where its optimiser SHOULD be able see that the value is already less than 256! Stefan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Who cares about performance (or Intel's CPU errata)? 2023-05-27 22:52 ` Stefan Kanthak @ 2023-05-27 23:18 ` Mark Wielaard 2023-05-27 23:30 ` Andrew Pinski 2023-05-28 6:40 ` Who cares about performance (or Intel's CPU errata)? Nicholas Vinson 2 siblings, 0 replies; 9+ messages in thread From: Mark Wielaard @ 2023-05-27 23:18 UTC (permalink / raw) To: Stefan Kanthak; +Cc: Andrew Pinski, gcc Hi Stefan, On Sun, May 28, 2023 at 12:52:30AM +0200, Stefan Kanthak wrote: > >> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are: This is totally not constructive. You should stop attacking others like that on this list. > OUCH, my fault; sorry for the confusion and the wrong accusation. Thanks for apologizing. There is a system for reporting bugs. As others have requested, please use that: https://gcc.gnu.org/bugs/#where Please stop ranting on this list and report bugs in bugzilla. Or you might get banned from the list. Thanks, Mark ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Who cares about performance (or Intel's CPU errata)? 2023-05-27 22:52 ` Stefan Kanthak 2023-05-27 23:18 ` Mark Wielaard @ 2023-05-27 23:30 ` Andrew Pinski 2023-05-28 7:47 ` Stefan Kanthak ` (2 more replies) 2023-05-28 6:40 ` Who cares about performance (or Intel's CPU errata)? Nicholas Vinson 2 siblings, 3 replies; 9+ messages in thread From: Andrew Pinski @ 2023-05-27 23:30 UTC (permalink / raw) To: Stefan Kanthak; +Cc: gcc On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote: > > "Andrew Pinski" <pinskia@gmail.com> wrote: > > > On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote: > >> > >> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are: > >> > >> --- dontcare.c --- > >> int ispowerof2(unsigned __int128 argument) { > >> return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1; > >> } > >> --- EOF --- > >> > >> GCC 13.3 gcc -march=haswell -O3 > >> > >> https://gcc.godbolt.org/z/PPzYsPzMc > >> ispowerof2(unsigned __int128): > >> popcnt rdi, rdi > >> popcnt rsi, rsi > >> add esi, edi > >> xor eax, eax > >> cmp esi, 1 > >> sete al > >> ret > >> > >> OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output? > > > > Because the popcount is going to the same register, there is no false > > dependency .... > > The false dependency errata only applies if the result of the popcnt > > is going to a different register, the processor thinks it depends on > > the result in that register from a previous instruction but it does > > not (which is why it is called a false dependency). In this case it > > actually does depend on the previous result since the input is the > > same as the input. > > OUCH, my fault; sorry for the confusion and the wrong accusation. > > Nevertheless GCC fails to optimise code properly: > > --- .c --- > int ispowerof2(unsigned long long argument) { > return __builtin_popcountll(argument) == 1; > } > --- EOF --- > > GCC 13.3 gcc -m32 -mpopcnt -O3 > > https://godbolt.org/z/fT7a7jP4e > ispowerof2(unsigned long long): > xor eax, eax > xor edx, edx > popcnt eax, [esp+4] > popcnt edx, [esp+8] > add eax, edx # eax is less than 64! > cmp eax, 1 -> dec eax # 2 bytes shorter dec eax is done for -Os already. -O2 means performance, it does not mean decrease size. dec can be slower as it can create a false dependency and it requires eax register to be not alive at the end of the statement. and IIRC for x86 decode, it could cause 2 (not 1) micro-ops. > sete al > movzx eax, al # superfluous No it is not superfluous, well ok it is because of the context of eax (besides the lower 8 bits) are already zero'd but keeping that track is a hard problem and is turning problem really. And I suspect it would cause another false dependency later on too. For -Os -march=skylake (and -Oz instead of -Os) we get: popcnt rdi, rdi popcnt rsi, rsi add esi, edi xor eax, eax dec esi sete al Which is exactly what you want right? Thanks, Andrew > ret > > 5 bytes and 1 instruction saved; 5 bytes here and there accumulate to > kilo- or even megabytes, and they can extend code to cross a cache line > or a 16-byte alignment boundary. > > JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument > > JFTR: GCC is notorious for generating superfluous MOVZX instructions > where its optimiser SHOULD be able see that the value is already > less than 256! > > Stefan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Who cares about performance (or Intel's CPU errata)? 2023-05-27 23:30 ` Andrew Pinski @ 2023-05-28 7:47 ` Stefan Kanthak 2023-05-28 12:29 ` David Brown 2023-05-29 18:11 ` Who cares about size? (was: Who cares about performance (or Intel's CPU errata)?) Stefan Kanthak 2 siblings, 0 replies; 9+ messages in thread From: Stefan Kanthak @ 2023-05-28 7:47 UTC (permalink / raw) To: Andrew Pinski; +Cc: gcc "Andrew Pinski" <pinskia@gmail.com> wrote: > On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote: [...] >> Nevertheless GCC fails to optimise code properly: >> >> --- .c --- >> int ispowerof2(unsigned long long argument) { >> return __builtin_popcountll(argument) == 1; >> } >> --- EOF --- >> >> GCC 13.3 gcc -m32 -mpopcnt -O3 >> >> https://godbolt.org/z/fT7a7jP4e >> ispowerof2(unsigned long long): >> xor eax, eax >> xor edx, edx >> popcnt eax, [esp+4] >> popcnt edx, [esp+8] >> add eax, edx # eax is less than 64! >> cmp eax, 1 -> dec eax # 2 bytes shorter > > dec eax is done for -Os already. -O2 means performance, it does not > mean decrease size. dec can be slower as it can create a false > dependency and it requires eax register to be not alive at the end of > the statement. and IIRC for x86 decode, it could cause 2 (not 1) > micro-ops. It CAN, it COULD, but is does NOT NEED to: it all depends on the target processor. Shall I add an example with -march=<not affected processor>? >> sete al Depending on the target processor the partial register can also harm the performance. Did you forget to mention that too? >> movzx eax, al # superfluous > > No it is not superfluous, well ok it is because of the context of eax > (besides the lower 8 bits) are already zero'd Correct. The same holds for example for PMOVMSKB when the high(er) lane(s) of the source [XYZ]MM register are (known to be) 0, for example after MOVQ; that's what GCC also fails to track. > but keeping that track is a hard problem and is turning problem really. Aren't such problems just there to be solved? > And I suspect it would cause another false dependency later on too. All these quirks can be avoided with the following 6-byte code sequence (same size as SETcc plus MOVZX) I used in one of my previous posts to fold any non-zero value to 1: neg eax sbb eax, eax neg eax No partial register writes, no false dependencies, no INC/DEC subleties. JFTR: AMD documents that SBB with same destination and source is handled in the register renamer; I suspect Intel processors do it too, albeit not documented. > For -Os -march=skylake (and -Oz instead of -Os) we get: > popcnt rdi, rdi > popcnt rsi, rsi > add esi, edi > xor eax, eax > dec esi > sete al > > Which is exactly what you want right? Yes. For -m32 -Os/-Oz, AND if CDQ breaks the dependency, it should be xor eax, eax xor edx, edx -> cdq # 1 byte shorter popcnt eax, [esp+4] popcnt edx, [esp+8] add eax, edx # eax is less than 64! cmp eax, 1 -> dec eax # 2 bytes shorter On AMD64 DEC <r32> is a 2-byte instruction; the following alternative code avoids its potential false dependency as well as other possible quirks, and also suits -Ot, -O2 and -O3 on processors where the register renamer handles the XOR: popcnt rdi, rdi popcnt rsi, rsi xor eax, eax not edi # edi = -(edi + 1) sub edi, esi # edi = -(edi + 1 + esi) setz al For processors where the register renamer doesn't "execute" XOR, but MOV, the following code is an alternative for -Ot, -O2 and -O3: popcnt rdi, rdi popcnt rsi, rsi mov eax, edi add eax, esi cmp eax, 1 setz al Stefan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Who cares about performance (or Intel's CPU errata)? 2023-05-27 23:30 ` Andrew Pinski 2023-05-28 7:47 ` Stefan Kanthak @ 2023-05-28 12:29 ` David Brown 2023-05-29 18:11 ` Who cares about size? (was: Who cares about performance (or Intel's CPU errata)?) Stefan Kanthak 2 siblings, 0 replies; 9+ messages in thread From: David Brown @ 2023-05-28 12:29 UTC (permalink / raw) To: gcc On 28/05/2023 01:30, Andrew Pinski via Gcc wrote: > On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote: >> >> sete al >> movzx eax, al # superfluous > > No it is not superfluous, well ok it is because of the context of eax > (besides the lower 8 bits) are already zero'd but keeping that track > is a hard problem and is turning problem really. And I suspect it > would cause another false dependency later on too. > > For -Os -march=skylake (and -Oz instead of -Os) we get: > popcnt rdi, rdi > popcnt rsi, rsi > add esi, edi > xor eax, eax > dec esi > sete al > > Which is exactly what you want right? > > Thanks, > Andrew > There is also the option of using "bool" as the return type for boolean functions, rather than "int". When returning a "bool", gcc does not add the "movzx eax, al" instruction. (There are some circumstances where returning "int" for a boolean value is a better choice, but usually "bool" makes more sense, and it can often be a touch more efficient.) David ^ permalink raw reply [flat|nested] 9+ messages in thread
* Who cares about size? (was: Who cares about performance (or Intel's CPU errata)?) 2023-05-27 23:30 ` Andrew Pinski 2023-05-28 7:47 ` Stefan Kanthak 2023-05-28 12:29 ` David Brown @ 2023-05-29 18:11 ` Stefan Kanthak 2 siblings, 0 replies; 9+ messages in thread From: Stefan Kanthak @ 2023-05-29 18:11 UTC (permalink / raw) To: Andrew Pinski; +Cc: gcc "Andrew Pinski" <pinskia@gmail.com> wrote: > On Sat, May 27, 2023 at 3:54 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote: >> Nevertheless GCC fails to optimise code properly: >> >> --- .c --- >> int ispowerof2(unsigned long long argument) { >> return __builtin_popcountll(argument) == 1; >> } >> --- EOF --- >> >> GCC 13.3 gcc -m32 -mpopcnt -O3 >> >> https://godbolt.org/z/fT7a7jP4e >> ispowerof2(unsigned long long): >> xor eax, eax >> xor edx, edx >> popcnt eax, [esp+4] >> popcnt edx, [esp+8] >> add eax, edx # eax is less than 64! >> cmp eax, 1 -> dec eax # 2 bytes shorter sete al movzx eax, al ret > > dec eax is done for -Os already. > -O2 means performance, it does not mean decrease size. But -Os is supposed to optimise for size? REALITY CHECK: GCC 13.3 gcc -m32 -mpopcnt -Os https://godbolt.org/z/41Ed6rr6r ispowerof2(unsigned long long): push ebp mov ebp, esp sub esp, 16 push [ebp+12] push [ebp+8] call __popcountdi2 add esp, 16 # superfluous! leave dec eax sete al movzx eax, al ret While -O3 generates 9 instructions in a total of 24 bytes, -Os[lowmotion] but generates 12 instructions in 29 bytes, PLUS the uncounted instructions/bytes of __popcountdi2()! This is what I call an EPIC FAILURE! Stefan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Who cares about performance (or Intel's CPU errata)? 2023-05-27 22:52 ` Stefan Kanthak 2023-05-27 23:18 ` Mark Wielaard 2023-05-27 23:30 ` Andrew Pinski @ 2023-05-28 6:40 ` Nicholas Vinson 2 siblings, 0 replies; 9+ messages in thread From: Nicholas Vinson @ 2023-05-28 6:40 UTC (permalink / raw) To: gcc On 5/27/23 18:52, Stefan Kanthak wrote: > "Andrew Pinski" <pinskia@gmail.com> wrote: > >> On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak <stefan.kanthak@nexgo.de> wrote: >>> Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are: >>> >>> --- dontcare.c --- >>> int ispowerof2(unsigned __int128 argument) { >>> return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1; >>> } >>> --- EOF --- >>> >>> GCC 13.3 gcc -march=haswell -O3 >>> >>> https://gcc.godbolt.org/z/PPzYsPzMc >>> ispowerof2(unsigned __int128): >>> popcnt rdi, rdi >>> popcnt rsi, rsi >>> add esi, edi >>> xor eax, eax >>> cmp esi, 1 >>> sete al >>> ret >>> >>> OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output? >> Because the popcount is going to the same register, there is no false >> dependency .... >> The false dependency errata only applies if the result of the popcnt >> is going to a different register, the processor thinks it depends on >> the result in that register from a previous instruction but it does >> not (which is why it is called a false dependency). In this case it >> actually does depend on the previous result since the input is the >> same as the input. > OUCH, my fault; sorry for the confusion and the wrong accusation. > > Nevertheless GCC fails to optimise code properly: > > --- .c --- > int ispowerof2(unsigned long long argument) { > return __builtin_popcountll(argument) == 1; > } > --- EOF --- > > GCC 13.3 gcc -m32 -mpopcnt -O3 > > https://godbolt.org/z/fT7a7jP4e > ispowerof2(unsigned long long): > xor eax, eax > xor edx, edx > popcnt eax, [esp+4] > popcnt edx, [esp+8] > add eax, edx # eax is less than 64! Less than or equal to 64 (consider the case when input is (unsigned long long)-1) > cmp eax, 1 -> dec eax # 2 bytes shorter > sete al > movzx eax, al # superfluous Not when dec is used. Use dec and omit this instruction, you may get a result value of 0xffffff00 (consider the case when input is (unsigned long long)0). > ret > > 5 bytes and 1 instruction saved; 5 bytes here and there accumulate to > kilo- or even megabytes, and they can extend code to cross a cache line > or a 16-byte alignment boundary. > > JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument > > JFTR: GCC is notorious for generating superfluous MOVZX instructions > where its optimiser SHOULD be able see that the value is already > less than 256! > > Stefan ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-05-29 18:19 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-05-27 21:20 Who cares about performance (or Intel's CPU errata)? Stefan Kanthak 2023-05-27 21:47 ` Andrew Pinski 2023-05-27 22:52 ` Stefan Kanthak 2023-05-27 23:18 ` Mark Wielaard 2023-05-27 23:30 ` Andrew Pinski 2023-05-28 7:47 ` Stefan Kanthak 2023-05-28 12:29 ` David Brown 2023-05-29 18:11 ` Who cares about size? (was: Who cares about performance (or Intel's CPU errata)?) Stefan Kanthak 2023-05-28 6:40 ` Who cares about performance (or Intel's CPU errata)? Nicholas Vinson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).