[Bug rtl-optimization/103550] New: 2 more instructions generated by gcc than clang

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/103550] New: 2 more instructions generated by gcc than clang
@ 2021-12-04  6:56 unlvsur at live dot com
  2021-12-04  6:59 ` [Bug rtl-optimization/103550] " pinskia at gcc dot gnu.org
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: unlvsur at live dot com @ 2021-12-04  6:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550

            Bug ID: 103550
           Summary: 2 more instructions generated by gcc than clang
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: unlvsur at live dot com
  Target Milestone: ---

GCC
https://godbolt.org/z/Y5W3xfeao
clang
https://godbolt.org/z/8EW6v77PP

GCC generates 2 more instructions. why?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/103550] 2 more instructions generated by gcc than clang
  2021-12-04  6:56 [Bug rtl-optimization/103550] New: 2 more instructions generated by gcc than clang unlvsur at live dot com
@ 2021-12-04  6:59 ` pinskia at gcc dot gnu.org
  2021-12-04  7:04 ` pinskia at gcc dot gnu.org
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-04  6:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
#include<bit>
#include<cstdint>

inline constexpr std::uint_least64_t s0(std::uint_least64_t x) noexcept
{
        return std::rotr(x,1)^std::rotr(x,8)^(x>>7);
}

inline constexpr std::uint_least64_t s1(auto x) noexcept
{
        return std::rotr(x,19)^std::rotr(x,61)^(x>>6);
}

void rd(std::uint_least64_t* x) noexcept
{
    x[0]+=s0(x[1])+s1(x[14])+x[9];
}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/103550] 2 more instructions generated by gcc than clang
  2021-12-04  6:56 [Bug rtl-optimization/103550] New: 2 more instructions generated by gcc than clang unlvsur at live dot com
  2021-12-04  6:59 ` [Bug rtl-optimization/103550] " pinskia at gcc dot gnu.org
@ 2021-12-04  7:04 ` pinskia at gcc dot gnu.org
  2021-12-04  7:07 ` unlvsur at live dot com
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-04  7:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
           Keywords|                            |missed-optimization, ra

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Looks like it is a register allocation/scheduling issue. The extra instructions
are mov.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/103550] 2 more instructions generated by gcc than clang
  2021-12-04  6:56 [Bug rtl-optimization/103550] New: 2 more instructions generated by gcc than clang unlvsur at live dot com
  2021-12-04  6:59 ` [Bug rtl-optimization/103550] " pinskia at gcc dot gnu.org
  2021-12-04  7:04 ` pinskia at gcc dot gnu.org
@ 2021-12-04  7:07 ` unlvsur at live dot com
  2021-12-04  7:08 ` unlvsur at live dot com
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: unlvsur at live dot com @ 2021-12-04  7:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550

--- Comment #3 from cqwrteur <unlvsur at live dot com> ---
(In reply to Andrew Pinski from comment #2)
> Looks like it is a register allocation/scheduling issue. The extra
> instructions are mov.

yeah. I feel gcc generally has issues with register allocations.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/103550] 2 more instructions generated by gcc than clang
  2021-12-04  6:56 [Bug rtl-optimization/103550] New: 2 more instructions generated by gcc than clang unlvsur at live dot com
                   ` (2 preceding siblings ...)
  2021-12-04  7:07 ` unlvsur at live dot com
@ 2021-12-04  7:08 ` unlvsur at live dot com
  2021-12-04  7:15 ` pinskia at gcc dot gnu.org
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: unlvsur at live dot com @ 2021-12-04  7:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550

--- Comment #4 from cqwrteur <unlvsur at live dot com> ---
(In reply to Andrew Pinski from comment #2)
> Looks like it is a register allocation/scheduling issue. The extra
> instructions are mov.

Are there good algos that can allocate registers optimal?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/103550] 2 more instructions generated by gcc than clang
  2021-12-04  6:56 [Bug rtl-optimization/103550] New: 2 more instructions generated by gcc than clang unlvsur at live dot com
                   ` (3 preceding siblings ...)
  2021-12-04  7:08 ` unlvsur at live dot com
@ 2021-12-04  7:15 ` pinskia at gcc dot gnu.org
  2021-12-04  7:22 ` unlvsur at live dot com
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-04  7:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to cqwrteur from comment #4)
> (In reply to Andrew Pinski from comment #2)
> > Looks like it is a register allocation/scheduling issue. The extra
> > instructions are mov.
> 
> Are there good algos that can allocate registers optimal?

note the move instructions might be "free" on most modern x86 machine, it just
takes up icache space and decode time.
having so little registers and having a 2 operand instruction set makes
register allocation a hard problem really. Yes LLVM might get it right in this
testcase but there are others where GCC might do a better job.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/103550] 2 more instructions generated by gcc than clang
  2021-12-04  6:56 [Bug rtl-optimization/103550] New: 2 more instructions generated by gcc than clang unlvsur at live dot com
                   ` (4 preceding siblings ...)
  2021-12-04  7:15 ` pinskia at gcc dot gnu.org
@ 2021-12-04  7:22 ` unlvsur at live dot com
  2021-12-04  7:32 ` unlvsur at live dot com
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: unlvsur at live dot com @ 2021-12-04  7:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550

--- Comment #6 from cqwrteur <unlvsur at live dot com> ---
(In reply to Andrew Pinski from comment #5)
> (In reply to cqwrteur from comment #4)
> > (In reply to Andrew Pinski from comment #2)
> > > Looks like it is a register allocation/scheduling issue. The extra
> > > instructions are mov.
> > 
> > Are there good algos that can allocate registers optimal?
> 
> note the move instructions might be "free" on most modern x86 machine, it
> just takes up icache space and decode time.
> having so little registers and having a 2 operand instruction set makes
> register allocation a hard problem really. Yes LLVM might get it right in
> this testcase but there are others where GCC might do a better job.

I know. I am just investigating why compilers generate lesser optimal assembly
than openssl for sha512.

https://github.com/tearosccebe/fast_io/blob/988d75ddb4af7c745df97124a6f3d1842936bfa3/include/fast_io_crypto/hash/sha512_scalar.h#L20

One round GCC would generate 55 instructions while OpenSSL only needs 47
instructions. The performance difference is quite noticeable since more
register allocations here might add more trivial load/store to memory for
saving temporaries.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/103550] 2 more instructions generated by gcc than clang
  2021-12-04  6:56 [Bug rtl-optimization/103550] New: 2 more instructions generated by gcc than clang unlvsur at live dot com
                   ` (5 preceding siblings ...)
  2021-12-04  7:22 ` unlvsur at live dot com
@ 2021-12-04  7:32 ` unlvsur at live dot com
  2021-12-04 11:30 ` roger at nextmovesoftware dot com
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: unlvsur at live dot com @ 2021-12-04  7:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550

--- Comment #7 from cqwrteur <unlvsur at live dot com> ---
(In reply to Andrew Pinski from comment #5)
> (In reply to cqwrteur from comment #4)
> > (In reply to Andrew Pinski from comment #2)
> > > Looks like it is a register allocation/scheduling issue. The extra
> > > instructions are mov.
> > 
> > Are there good algos that can allocate registers optimal?
> 
> note the move instructions might be "free" on most modern x86 machine, it
> just takes up icache space and decode time.
> having so little registers and having a 2 operand instruction set makes
> register allocation a hard problem really. Yes LLVM might get it right in
> this testcase but there are others where GCC might do a better job.

https://github.com/openssl/openssl/blob/38288f424faa0cf61bd705c497bb1a1657611da1/crypto/sha/asm/sha512-x86_64.pl#L18

OpenSSL's comments:

# 40% improvement over compiler-generated code on Opteron. On EM64T
# sha256 was observed to run >80% faster and sha512 - >40%. No magical
# tricks, just straight implementation... I really wonder why gcc
# [being armed with inline assembler] fails to generate as fast code.
# The only thing which is cool about this module is that it's very
# same instruction sequence used for both SHA-256 and SHA-512. In
# former case the instructions operate on 32-bit operands, while in
# latter - on 64-bit ones. All I had to do is to get one flavor right,
# the other one passed the test right away:-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/103550] 2 more instructions generated by gcc than clang
  2021-12-04  6:56 [Bug rtl-optimization/103550] New: 2 more instructions generated by gcc than clang unlvsur at live dot com
                   ` (6 preceding siblings ...)
  2021-12-04  7:32 ` unlvsur at live dot com
@ 2021-12-04 11:30 ` roger at nextmovesoftware dot com
  2021-12-04 11:52 ` roger at nextmovesoftware dot com
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: roger at nextmovesoftware dot com @ 2021-12-04 11:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |roger at nextmovesoftware dot com

--- Comment #8 from Roger Sayle <roger at nextmovesoftware dot com> ---
I think this is a reassociation problem (rather than a register allocation
issue).
On x86-like machines, it's better to evaluate "*mem + op" as tmp=op; tmp +=
*mem";  taking advantage of the available addressing modes, than as "tmp =
*mem; tmp2 = op;  tmp += tmp2".

This is demonstrated by the following simplified (and over-simplified) test
case:
inline unsigned long s0(unsigned long x) {
  return (x>>8) | (x<<56);
}

void foo(unsigned long *x) {
  x[0] += s0(x[1]) + x[9];
}

void bar(unsigned long *x) {
  x[0] += s0(x[1]) + s0(x[14]) + x[9];
}

In the over-simplified foo, GCC generates optimal code, performing the
computation (rotation) first, then adding from memory, and finally adding to
memory.

foo:    movq    8(%rdi), %rax
        rorq    $8, %rax
        addq    72(%rdi), %rax
        addq    %rax, (%rdi)
        ret

which is much better than the (hypothetical) reverse order, which requires
more instructions (the extra moves observed by cqwrteur) and extra registers.

badfoo: movq    (%rdi), %rax
        addq    72(%rdi), %rax
        movq    8(%rdi), %rdx
        rorq    $8, %rdx
        addq    %rdx, %rax
        movq    %rax, (%rdi)
        ret

Alas, this is exactly what happens for more complex cases, like bar above.
With is currently compiled by mainline GCC to:

bar:    movq    8(%rdi), %rdx
        movq    (%rdi), %rax
        addq    72(%rdi), %rax
        rorq    $8, %rdx
        addq    %rdx, %rax
        movq    112(%rdi), %rdx
        rorq    $8, %rdx
        addq    %rdx, %rax
        movq    %rax, (%rdi)
        ret

The issue appears in the tree-ssa optimizers.  The .t006.gimple for bar
initially has the additions of memory last:

void bar (long unsigned int * x)
{
  long unsigned int D.1990;

  _1 = x + 8;
  _2 = *_1;
  _3 = s0 (_2);
  _4 = x + 112;
  _5 = *_4;
  _6 = s0 (_5);
  _7 = _3 + _6;
  _8 = x + 72;
  _9 = *_8;
  D.1990 = _7 + _9;
  _10 = *x;
  _11 = D.1990 + _10;
  *x = _11;
}

but by the time the tree-ssa optimizers have finished with it, the reads from
memory have been reassociated to the front, so 250t.optimized contains:

void bar (long unsigned int * x)
{
  long unsigned int _1;
  long unsigned int _2;
  long unsigned int _4;
  long unsigned int _5;
  long unsigned int _6;
  long unsigned int _9;
  long unsigned int _10;
  long unsigned int _13;
  long unsigned int _14;

  <bb 2> [local count: 1073741824]:
  _1 = MEM[(long unsigned int *)x_7(D) + 8B];
  _9 = _1 r>> 8;
  _2 = MEM[(long unsigned int *)x_7(D) + 112B];
  _10 = _2 r>> 8;
  _4 = MEM[(long unsigned int *)x_7(D) + 72B];
  _5 = *x_7(D);
  _13 = _4 + _5;
  _14 = _9 + _13;
  _6 = _10 + _14;
  *x_7(D) = _6;
  return;
}

Notice the first addition, _4 + _5 mentions two memory references, rather than
the computations _9 or _10. I believe that a suitable heuristic is that when
reassociating, memory references should come last, or at least not first.
Hence (mem1 + mem2 + op3 + mem4 + op5) should reassociate as (op3 + op5 + mem1
+ mem2 + mem4).  And ideally, any destination memory should appear last, so
mem2 = (mem1 + mem2 + op3 + mem4 + op5) should be reassociated as mem2 = (op3 +
op5 + mem1 + mem4 + mem2).

I suspect the intent of moving memory references "early" is perhaps to reduce
memory latency, so perhaps the ideal compromise/hybrid is to reassociate a
computation first, then memory accesses as early as possible, and finally the
destination memory last, i.e. mem2 = (op3 + mem1 + mem4 + op5 + mem2).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/103550] 2 more instructions generated by gcc than clang
  2021-12-04  6:56 [Bug rtl-optimization/103550] New: 2 more instructions generated by gcc than clang unlvsur at live dot com
                   ` (7 preceding siblings ...)
  2021-12-04 11:30 ` roger at nextmovesoftware dot com
@ 2021-12-04 11:52 ` roger at nextmovesoftware dot com
  2021-12-04 13:56 ` unlvsur at live dot com
  2022-10-06  6:47 ` pinskia at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: roger at nextmovesoftware dot com @ 2021-12-04 11:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550

--- Comment #9 from Roger Sayle <roger at nextmovesoftware dot com> ---
Note adding -fno-tree-reassoc results in fewer instructions than clang.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/103550] 2 more instructions generated by gcc than clang
  2021-12-04  6:56 [Bug rtl-optimization/103550] New: 2 more instructions generated by gcc than clang unlvsur at live dot com
                   ` (8 preceding siblings ...)
  2021-12-04 11:52 ` roger at nextmovesoftware dot com
@ 2021-12-04 13:56 ` unlvsur at live dot com
  2022-10-06  6:47 ` pinskia at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: unlvsur at live dot com @ 2021-12-04 13:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550

--- Comment #10 from cqwrteur <unlvsur at live dot com> ---
(In reply to Roger Sayle from comment #9)
> Note adding -fno-tree-reassoc results in fewer instructions than clang.

maybe I am a little bit picky since I keep writing "portable assembly code"
with C++. I just frequently compare assembly generated by the compiler to
manually written assembly. However, I think that is how we got an improvement
on the GCC compiler since I do frequently find out suboptimal codegen with it.

The 21 instructions look really good. Do not know whether we can squeeze it
further.

Do not get me wrong. Actually, clang runs much slower for my sha512 in general
(since they generate much more instructions (around 10% more)). I just compare
GCC to openssl's manually written assembly (well it is actually generated by
perl script) and i feel it moves more registers.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug rtl-optimization/103550] 2 more instructions generated by gcc than clang
  2021-12-04  6:56 [Bug rtl-optimization/103550] New: 2 more instructions generated by gcc than clang unlvsur at live dot com
                   ` (9 preceding siblings ...)
  2021-12-04 13:56 ` unlvsur at live dot com
@ 2022-10-06  6:47 ` pinskia at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-10-06  6:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550

--- Comment #11 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
*** Bug 107167 has been marked as a duplicate of this bug. ***

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-10-06  6:47 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-04  6:56 [Bug rtl-optimization/103550] New: 2 more instructions generated by gcc than clang unlvsur at live dot com
2021-12-04  6:59 ` [Bug rtl-optimization/103550] " pinskia at gcc dot gnu.org
2021-12-04  7:04 ` pinskia at gcc dot gnu.org
2021-12-04  7:07 ` unlvsur at live dot com
2021-12-04  7:08 ` unlvsur at live dot com
2021-12-04  7:15 ` pinskia at gcc dot gnu.org
2021-12-04  7:22 ` unlvsur at live dot com
2021-12-04  7:32 ` unlvsur at live dot com
2021-12-04 11:30 ` roger at nextmovesoftware dot com
2021-12-04 11:52 ` roger at nextmovesoftware dot com
2021-12-04 13:56 ` unlvsur at live dot com
2022-10-06  6:47 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).