public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached @ 2022-10-06 5:33 unlvsur at live dot com 2022-10-06 5:55 ` [Bug rtl-optimization/107167] " pinskia at gcc dot gnu.org ` (5 more replies) 0 siblings, 6 replies; 7+ messages in thread From: unlvsur at live dot com @ 2022-10-06 5:33 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167 Bug ID: 107167 Summary: It looks like GCC wastes registers on trivial computations when result can be cached Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: unlvsur at live dot com Target Milestone: --- I do not know whether it is a big issue or not with targets that provide tons of available registers (like aarch64 or loongarch64). However, this looks like a big issue for x86_64 which only provides 16 general purpose registers (plus %rsp is reserved, so 15 available registers) Take the example like this: https://godbolt.org/z/77rEsr1PG #include<bit> unsigned Sigma1(unsigned x) noexcept { return std::rotr(x,6)^std::rotr(x,11)^std::rotr(x,25); } GCC generates code like this to avoid dependencies. Sigma1m(unsigned int): movl %edi, %eax movl %edi, %edx roll $7, %edi rorl $6, %eax rorl $11, %edx xorl %edx, %eax xorl %edi, %eax ret However: mySigma1m(unsigned int): movl %edi, %eax rorl $6, %edi rorl $11, %eax xorl %edi, %eax rorl $19, %edi xorl %edi, %eax ret Saves one register in this task. That becomes a huge problem when tons of computation are involved where registers are in a position of shortage. 1st one also generates 1 more instruction and it can affect the code cache. Aggressively utilizing all registers may not give the best results. Local maximum =/= Global maximum. I don't know. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached 2022-10-06 5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com @ 2022-10-06 5:55 ` pinskia at gcc dot gnu.org 2022-10-06 6:00 ` unlvsur at live dot com ` (4 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: pinskia at gcc dot gnu.org @ 2022-10-06 5:55 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167 Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement Keywords| |missed-optimization --- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- This is a reassociation, scheduling issue and register allocation issue. Plus your example might be slower due to dependencies. Without a full example of where gcc ra goes wrong, gcc actually produces much better code for this example due to register renaming in hw. Note many x86_64 also does register renaming for the stack too ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached 2022-10-06 5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com 2022-10-06 5:55 ` [Bug rtl-optimization/107167] " pinskia at gcc dot gnu.org @ 2022-10-06 6:00 ` unlvsur at live dot com 2022-10-06 6:03 ` unlvsur at live dot com ` (3 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: unlvsur at live dot com @ 2022-10-06 6:00 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167 --- Comment #2 from cqwrteur <unlvsur at live dot com> --- (In reply to Andrew Pinski from comment #1) > This is a reassociation, scheduling issue and register allocation issue. > > Plus your example might be slower due to dependencies. > > Without a full example of where gcc ra goes wrong, gcc actually produces > much better code for this example due to register renaming in hw. > Note many x86_64 also does register renaming for the stack too The problem I do things like sha512_round: sha512_round(x[0]=big_endian(W[0]),a,b,d,e,f,g,h,bpc,0x428a2f98d728ae22); sha512_round(x[1]=big_endian(W[1]),h,a,c,d,e,f,g,bpc,0x7137449123ef65cd); sha512_round(x[2]=big_endian(W[2]),g,h,b,c,d,e,f,bpc,0xb5c0fbcfec4d3b2f); sha512_round(x[3]=big_endian(W[3]),f,g,a,b,c,d,e,bpc,0xe9b5dba58189dbbc); sha512_round(x[4]=big_endian(W[4]),e,f,h,a,b,c,d,bpc,0x3956c25bf348b538); They use tons of registers. If GCC wastes registers, tons of time would waste on stack push/load. My implementation by GCC on x86_64 is slower than openssl's asm version particularly due to this reason. GCC just pushes/stores too many values on the stack. https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha512-x86_64.pl#L192 OpenSSL does exactly what I do here. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached 2022-10-06 5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com 2022-10-06 5:55 ` [Bug rtl-optimization/107167] " pinskia at gcc dot gnu.org 2022-10-06 6:00 ` unlvsur at live dot com @ 2022-10-06 6:03 ` unlvsur at live dot com 2022-10-06 6:05 ` unlvsur at live dot com ` (2 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: unlvsur at live dot com @ 2022-10-06 6:03 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167 --- Comment #3 from cqwrteur <unlvsur at live dot com> --- (In reply to Andrew Pinski from comment #1) > This is a reassociation, scheduling issue and register allocation issue. > > Plus your example might be slower due to dependencies. > > Without a full example of where gcc ra goes wrong, gcc actually produces > much better code for this example due to register renaming in hw. > Note many x86_64 also does register renaming for the stack too https://github.com/openssl/openssl/blob/a8572674f12ceb39f7e66ccbaa8918b922c76739/crypto/sha/asm/sha512-x86_64.pl#L16 They mentioned that before. 40% improvement over compiler-generated code. "I really wonder why gcc # [being armed with inline assembler] fails to generate as fast code." # sha256/512_block procedure for x86_64. # # 40% improvement over compiler-generated code on Opteron. On EM64T # sha256 was observed to run >80% faster and sha512 - >40%. No magical # tricks, just straight implementation... I really wonder why gcc # [being armed with inline assembler] fails to generate as fast code. # The only thing which is cool about this module is that it's very # same instruction sequence used for both SHA-256 and SHA-512. In # former case the instructions operate on 32-bit operands, while in # latter - on 64-bit ones. All I had to do is to get one flavor right, # the other one passed the test right away:-) ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached 2022-10-06 5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com ` (2 preceding siblings ...) 2022-10-06 6:03 ` unlvsur at live dot com @ 2022-10-06 6:05 ` unlvsur at live dot com 2022-10-06 6:47 ` pinskia at gcc dot gnu.org 2022-10-07 1:41 ` unlvsur at live dot com 5 siblings, 0 replies; 7+ messages in thread From: unlvsur at live dot com @ 2022-10-06 6:05 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167 --- Comment #4 from cqwrteur <unlvsur at live dot com> --- (In reply to Andrew Pinski from comment #1) > Plus your example might be slower due to dependencies. Dependency is only an issue to a certain degree. 1st one it has things like "movl %edi, %edx; rorl $11, %edx" which is also a flow dependency. CPU solves flow dependency to a very large degree with register forwarding. Write dependencies are also dealt with register renaming if we save registers, register renaming will also save time. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached 2022-10-06 5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com ` (3 preceding siblings ...) 2022-10-06 6:05 ` unlvsur at live dot com @ 2022-10-06 6:47 ` pinskia at gcc dot gnu.org 2022-10-07 1:41 ` unlvsur at live dot com 5 siblings, 0 replies; 7+ messages in thread From: pinskia at gcc dot gnu.org @ 2022-10-06 6:47 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167 Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |DUPLICATE Status|UNCONFIRMED |RESOLVED --- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> --- You already filed this one. *** This bug has been marked as a duplicate of bug 103550 *** ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached 2022-10-06 5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com ` (4 preceding siblings ...) 2022-10-06 6:47 ` pinskia at gcc dot gnu.org @ 2022-10-07 1:41 ` unlvsur at live dot com 5 siblings, 0 replies; 7+ messages in thread From: unlvsur at live dot com @ 2022-10-07 1:41 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167 --- Comment #6 from cqwrteur <unlvsur at live dot com> --- (In reply to Andrew Pinski from comment #5) > You already filed this one. > > *** This bug has been marked as a duplicate of bug 103550 *** (In reply to Andrew Pinski from comment #1) > This is a reassociation, scheduling issue and register allocation issue. > > Plus your example might be slower due to dependencies. > > Without a full example of where gcc ra goes wrong, gcc actually produces > much better code for this example due to register renaming in hw. > Note many x86_64 also does register renaming for the stack too On x86_64, I just checked uops.info, only two ports are available for rotr,rotl. They cannot really get paralleled executed. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-10-07 1:41 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-10-06 5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com 2022-10-06 5:55 ` [Bug rtl-optimization/107167] " pinskia at gcc dot gnu.org 2022-10-06 6:00 ` unlvsur at live dot com 2022-10-06 6:03 ` unlvsur at live dot com 2022-10-06 6:05 ` unlvsur at live dot com 2022-10-06 6:47 ` pinskia at gcc dot gnu.org 2022-10-07 1:41 ` unlvsur at live dot com
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).