public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached
@ 2022-10-06 5:33 unlvsur at live dot com
2022-10-06 5:55 ` [Bug rtl-optimization/107167] " pinskia at gcc dot gnu.org
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: unlvsur at live dot com @ 2022-10-06 5:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167
Bug ID: 107167
Summary: It looks like GCC wastes registers on trivial
computations when result can be cached
Product: gcc
Version: 13.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: unlvsur at live dot com
Target Milestone: ---
I do not know whether it is a big issue or not with targets that provide tons
of available registers (like aarch64 or loongarch64). However, this looks like
a big issue for x86_64 which only provides 16 general purpose registers (plus
%rsp is reserved, so 15 available registers)
Take the example like this:
https://godbolt.org/z/77rEsr1PG
#include<bit>
unsigned Sigma1(unsigned x) noexcept
{
return std::rotr(x,6)^std::rotr(x,11)^std::rotr(x,25);
}
GCC generates code like this to avoid dependencies.
Sigma1m(unsigned int):
movl %edi, %eax
movl %edi, %edx
roll $7, %edi
rorl $6, %eax
rorl $11, %edx
xorl %edx, %eax
xorl %edi, %eax
ret
However:
mySigma1m(unsigned int):
movl %edi, %eax
rorl $6, %edi
rorl $11, %eax
xorl %edi, %eax
rorl $19, %edi
xorl %edi, %eax
ret
Saves one register in this task. That becomes a huge problem when tons of
computation are involved where registers are in a position of shortage.
1st one also generates 1 more instruction and it can affect the code cache.
Aggressively utilizing all registers may not give the best results. Local
maximum =/= Global maximum.
I don't know.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached
2022-10-06 5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com
@ 2022-10-06 5:55 ` pinskia at gcc dot gnu.org
2022-10-06 6:00 ` unlvsur at live dot com
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-10-06 5:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
Keywords| |missed-optimization
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This is a reassociation, scheduling issue and register allocation issue.
Plus your example might be slower due to dependencies.
Without a full example of where gcc ra goes wrong, gcc actually produces much
better code for this example due to register renaming in hw.
Note many x86_64 also does register renaming for the stack too
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached
2022-10-06 5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com
2022-10-06 5:55 ` [Bug rtl-optimization/107167] " pinskia at gcc dot gnu.org
@ 2022-10-06 6:00 ` unlvsur at live dot com
2022-10-06 6:03 ` unlvsur at live dot com
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: unlvsur at live dot com @ 2022-10-06 6:00 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167
--- Comment #2 from cqwrteur <unlvsur at live dot com> ---
(In reply to Andrew Pinski from comment #1)
> This is a reassociation, scheduling issue and register allocation issue.
>
> Plus your example might be slower due to dependencies.
>
> Without a full example of where gcc ra goes wrong, gcc actually produces
> much better code for this example due to register renaming in hw.
> Note many x86_64 also does register renaming for the stack too
The problem I do things like sha512_round:
sha512_round(x[0]=big_endian(W[0]),a,b,d,e,f,g,h,bpc,0x428a2f98d728ae22);
sha512_round(x[1]=big_endian(W[1]),h,a,c,d,e,f,g,bpc,0x7137449123ef65cd);
sha512_round(x[2]=big_endian(W[2]),g,h,b,c,d,e,f,bpc,0xb5c0fbcfec4d3b2f);
sha512_round(x[3]=big_endian(W[3]),f,g,a,b,c,d,e,bpc,0xe9b5dba58189dbbc);
sha512_round(x[4]=big_endian(W[4]),e,f,h,a,b,c,d,bpc,0x3956c25bf348b538);
They use tons of registers. If GCC wastes registers, tons of time would waste
on stack push/load.
My implementation by GCC on x86_64 is slower than openssl's asm version
particularly due to this reason. GCC just pushes/stores too many values on the
stack.
https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha512-x86_64.pl#L192
OpenSSL does exactly what I do here.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached
2022-10-06 5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com
2022-10-06 5:55 ` [Bug rtl-optimization/107167] " pinskia at gcc dot gnu.org
2022-10-06 6:00 ` unlvsur at live dot com
@ 2022-10-06 6:03 ` unlvsur at live dot com
2022-10-06 6:05 ` unlvsur at live dot com
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: unlvsur at live dot com @ 2022-10-06 6:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167
--- Comment #3 from cqwrteur <unlvsur at live dot com> ---
(In reply to Andrew Pinski from comment #1)
> This is a reassociation, scheduling issue and register allocation issue.
>
> Plus your example might be slower due to dependencies.
>
> Without a full example of where gcc ra goes wrong, gcc actually produces
> much better code for this example due to register renaming in hw.
> Note many x86_64 also does register renaming for the stack too
https://github.com/openssl/openssl/blob/a8572674f12ceb39f7e66ccbaa8918b922c76739/crypto/sha/asm/sha512-x86_64.pl#L16
They mentioned that before. 40% improvement over compiler-generated code.
"I really wonder why gcc # [being armed with inline assembler] fails to
generate as fast code."
# sha256/512_block procedure for x86_64.
#
# 40% improvement over compiler-generated code on Opteron. On EM64T
# sha256 was observed to run >80% faster and sha512 - >40%. No magical
# tricks, just straight implementation... I really wonder why gcc
# [being armed with inline assembler] fails to generate as fast code.
# The only thing which is cool about this module is that it's very
# same instruction sequence used for both SHA-256 and SHA-512. In
# former case the instructions operate on 32-bit operands, while in
# latter - on 64-bit ones. All I had to do is to get one flavor right,
# the other one passed the test right away:-)
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached
2022-10-06 5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com
` (2 preceding siblings ...)
2022-10-06 6:03 ` unlvsur at live dot com
@ 2022-10-06 6:05 ` unlvsur at live dot com
2022-10-06 6:47 ` pinskia at gcc dot gnu.org
2022-10-07 1:41 ` unlvsur at live dot com
5 siblings, 0 replies; 7+ messages in thread
From: unlvsur at live dot com @ 2022-10-06 6:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167
--- Comment #4 from cqwrteur <unlvsur at live dot com> ---
(In reply to Andrew Pinski from comment #1)
> Plus your example might be slower due to dependencies.
Dependency is only an issue to a certain degree. 1st one it has things like
"movl %edi, %edx; rorl $11, %edx" which is also a flow dependency.
CPU solves flow dependency to a very large degree with register forwarding.
Write dependencies are also dealt with register renaming if we save registers,
register renaming will also save time.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached
2022-10-06 5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com
` (3 preceding siblings ...)
2022-10-06 6:05 ` unlvsur at live dot com
@ 2022-10-06 6:47 ` pinskia at gcc dot gnu.org
2022-10-07 1:41 ` unlvsur at live dot com
5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-10-06 6:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |DUPLICATE
Status|UNCONFIRMED |RESOLVED
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
You already filed this one.
*** This bug has been marked as a duplicate of bug 103550 ***
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached
2022-10-06 5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com
` (4 preceding siblings ...)
2022-10-06 6:47 ` pinskia at gcc dot gnu.org
@ 2022-10-07 1:41 ` unlvsur at live dot com
5 siblings, 0 replies; 7+ messages in thread
From: unlvsur at live dot com @ 2022-10-07 1:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167
--- Comment #6 from cqwrteur <unlvsur at live dot com> ---
(In reply to Andrew Pinski from comment #5)
> You already filed this one.
>
> *** This bug has been marked as a duplicate of bug 103550 ***
(In reply to Andrew Pinski from comment #1)
> This is a reassociation, scheduling issue and register allocation issue.
>
> Plus your example might be slower due to dependencies.
>
> Without a full example of where gcc ra goes wrong, gcc actually produces
> much better code for this example due to register renaming in hw.
> Note many x86_64 also does register renaming for the stack too
On x86_64, I just checked uops.info, only two ports are available for
rotr,rotl. They cannot really get paralleled executed.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-10-07 1:41 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-06 5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com
2022-10-06 5:55 ` [Bug rtl-optimization/107167] " pinskia at gcc dot gnu.org
2022-10-06 6:00 ` unlvsur at live dot com
2022-10-06 6:03 ` unlvsur at live dot com
2022-10-06 6:05 ` unlvsur at live dot com
2022-10-06 6:47 ` pinskia at gcc dot gnu.org
2022-10-07 1:41 ` unlvsur at live dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).