[Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached
@ 2022-10-06  5:33 unlvsur at live dot com
  2022-10-06  5:55 ` [Bug rtl-optimization/107167] " pinskia at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: unlvsur at live dot com @ 2022-10-06  5:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167

            Bug ID: 107167
           Summary: It looks like GCC wastes registers on trivial
                    computations when result can be cached
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: unlvsur at live dot com
  Target Milestone: ---

I do not know whether it is a big issue or not with targets that provide tons
of available registers (like aarch64 or loongarch64). However, this looks like
a big issue for x86_64 which only provides 16 general purpose registers (plus
%rsp is reserved, so 15 available registers)
Take the example like this:

https://godbolt.org/z/77rEsr1PG

#include<bit>

unsigned Sigma1(unsigned x) noexcept
{
    return std::rotr(x,6)^std::rotr(x,11)^std::rotr(x,25);
}


GCC generates code like this to avoid dependencies.
Sigma1m(unsigned int):
        movl    %edi, %eax
        movl    %edi, %edx
        roll    $7, %edi
        rorl    $6, %eax
        rorl    $11, %edx
        xorl    %edx, %eax
        xorl    %edi, %eax
        ret

However:
mySigma1m(unsigned int):
        movl    %edi, %eax
        rorl    $6, %edi
        rorl    $11, %eax
        xorl    %edi, %eax
        rorl    $19, %edi
        xorl    %edi, %eax
        ret

Saves one register in this task. That becomes a huge problem when tons of
computation are involved where registers are in a position of shortage.

1st one also generates 1 more instruction and it can affect the code cache.

Aggressively utilizing all registers may not give the best results. Local
maximum =/= Global maximum.
I don't know.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached
  2022-10-06  5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com
@ 2022-10-06  5:55 ` pinskia at gcc dot gnu.org
  2022-10-06  6:00 ` unlvsur at live dot com
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-10-06  5:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
           Keywords|                            |missed-optimization

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This is a reassociation, scheduling issue and register allocation issue.

Plus your example might be slower due to dependencies.

Without a full example of where gcc ra goes wrong, gcc actually produces much
better code for this example due to register renaming in hw.
Note many x86_64 also does register renaming for the stack too

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached
  2022-10-06  5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com
  2022-10-06  5:55 ` [Bug rtl-optimization/107167] " pinskia at gcc dot gnu.org
@ 2022-10-06  6:00 ` unlvsur at live dot com
  2022-10-06  6:03 ` unlvsur at live dot com
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: unlvsur at live dot com @ 2022-10-06  6:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167

--- Comment #2 from cqwrteur <unlvsur at live dot com> ---
(In reply to Andrew Pinski from comment #1)
> This is a reassociation, scheduling issue and register allocation issue.
> 
> Plus your example might be slower due to dependencies.
> 
> Without a full example of where gcc ra goes wrong, gcc actually produces
> much better code for this example due to register renaming in hw.
> Note many x86_64 also does register renaming for the stack too

The problem I do things like sha512_round:
               
sha512_round(x[0]=big_endian(W[0]),a,b,d,e,f,g,h,bpc,0x428a2f98d728ae22);
sha512_round(x[1]=big_endian(W[1]),h,a,c,d,e,f,g,bpc,0x7137449123ef65cd);
sha512_round(x[2]=big_endian(W[2]),g,h,b,c,d,e,f,bpc,0xb5c0fbcfec4d3b2f);
sha512_round(x[3]=big_endian(W[3]),f,g,a,b,c,d,e,bpc,0xe9b5dba58189dbbc);
sha512_round(x[4]=big_endian(W[4]),e,f,h,a,b,c,d,bpc,0x3956c25bf348b538);

They use tons of registers. If GCC wastes registers, tons of time would waste
on stack push/load.

My implementation by GCC on x86_64 is slower than openssl's asm version
particularly due to this reason. GCC just pushes/stores too many values on the
stack.

https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha512-x86_64.pl#L192

OpenSSL does exactly what I do here.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached
  2022-10-06  5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com
  2022-10-06  5:55 ` [Bug rtl-optimization/107167] " pinskia at gcc dot gnu.org
  2022-10-06  6:00 ` unlvsur at live dot com
@ 2022-10-06  6:03 ` unlvsur at live dot com
  2022-10-06  6:05 ` unlvsur at live dot com
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: unlvsur at live dot com @ 2022-10-06  6:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167

--- Comment #3 from cqwrteur <unlvsur at live dot com> ---
(In reply to Andrew Pinski from comment #1)
> This is a reassociation, scheduling issue and register allocation issue.
> 
> Plus your example might be slower due to dependencies.
> 
> Without a full example of where gcc ra goes wrong, gcc actually produces
> much better code for this example due to register renaming in hw.
> Note many x86_64 also does register renaming for the stack too

https://github.com/openssl/openssl/blob/a8572674f12ceb39f7e66ccbaa8918b922c76739/crypto/sha/asm/sha512-x86_64.pl#L16

They mentioned that before. 40% improvement over compiler-generated code.
"I really wonder why gcc # [being armed with inline assembler] fails to
generate as fast code."

# sha256/512_block procedure for x86_64.
#
# 40% improvement over compiler-generated code on Opteron. On EM64T
# sha256 was observed to run >80% faster and sha512 - >40%. No magical
# tricks, just straight implementation... I really wonder why gcc
# [being armed with inline assembler] fails to generate as fast code.
# The only thing which is cool about this module is that it's very
# same instruction sequence used for both SHA-256 and SHA-512. In
# former case the instructions operate on 32-bit operands, while in
# latter - on 64-bit ones. All I had to do is to get one flavor right,
# the other one passed the test right away:-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached
  2022-10-06  5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com
                   ` (2 preceding siblings ...)
  2022-10-06  6:03 ` unlvsur at live dot com
@ 2022-10-06  6:05 ` unlvsur at live dot com
  2022-10-06  6:47 ` pinskia at gcc dot gnu.org
  2022-10-07  1:41 ` unlvsur at live dot com
  5 siblings, 0 replies; 7+ messages in thread
From: unlvsur at live dot com @ 2022-10-06  6:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167

--- Comment #4 from cqwrteur <unlvsur at live dot com> ---
(In reply to Andrew Pinski from comment #1)
> Plus your example might be slower due to dependencies.


Dependency is only an issue to a certain degree. 1st one it has things like
"movl    %edi, %edx;  rorl    $11, %edx" which is also a flow dependency.

CPU solves flow dependency to a very large degree with register forwarding.

Write dependencies are also dealt with register renaming if we save registers,
register renaming will also save time.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached
  2022-10-06  5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com
                   ` (3 preceding siblings ...)
  2022-10-06  6:05 ` unlvsur at live dot com
@ 2022-10-06  6:47 ` pinskia at gcc dot gnu.org
  2022-10-07  1:41 ` unlvsur at live dot com
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-10-06  6:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |DUPLICATE
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
You already filed this one.

*** This bug has been marked as a duplicate of bug 103550 ***

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/107167] It looks like GCC wastes registers on trivial computations when result can be cached
  2022-10-06  5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com
                   ` (4 preceding siblings ...)
  2022-10-06  6:47 ` pinskia at gcc dot gnu.org
@ 2022-10-07  1:41 ` unlvsur at live dot com
  5 siblings, 0 replies; 7+ messages in thread
From: unlvsur at live dot com @ 2022-10-07  1:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107167

--- Comment #6 from cqwrteur <unlvsur at live dot com> ---
(In reply to Andrew Pinski from comment #5)
> You already filed this one.
> 
> *** This bug has been marked as a duplicate of bug 103550 ***

(In reply to Andrew Pinski from comment #1)
> This is a reassociation, scheduling issue and register allocation issue.
> 
> Plus your example might be slower due to dependencies.
> 
> Without a full example of where gcc ra goes wrong, gcc actually produces
> much better code for this example due to register renaming in hw.
> Note many x86_64 also does register renaming for the stack too

On x86_64, I just checked uops.info, only two ports are available for
rotr,rotl. They cannot really get paralleled executed.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-10-07  1:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-06  5:33 [Bug rtl-optimization/107167] New: It looks like GCC wastes registers on trivial computations when result can be cached unlvsur at live dot com
2022-10-06  5:55 ` [Bug rtl-optimization/107167] " pinskia at gcc dot gnu.org
2022-10-06  6:00 ` unlvsur at live dot com
2022-10-06  6:03 ` unlvsur at live dot com
2022-10-06  6:05 ` unlvsur at live dot com
2022-10-06  6:47 ` pinskia at gcc dot gnu.org
2022-10-07  1:41 ` unlvsur at live dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).