[Bug c++/108953] New: inefficient codegen for trivial equality

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c++/108953] New: inefficient codegen for trivial equality
@ 2023-02-27 16:29 barry.revzin at gmail dot com
  2023-02-27 16:38 ` [Bug c++/108953] " pinskia at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: barry.revzin at gmail dot com @ 2023-02-27 16:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108953

            Bug ID: 108953
           Summary: inefficient codegen for trivial equality
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: barry.revzin at gmail dot com
  Target Milestone: ---

Consider this example:

#include <cstdint>
#include <cstddef>
#include <string.h>

struct C
{
   uint8_t a;
   uint8_t b;
   uint8_t c;
   uint8_t d;
   uint16_t e;
   uint16_t f;
   int32_t g;

   bool operator==(C const&) const = default;
};

bool check(C const& lhs, C const& rhs) {
    #ifdef MEMCMP
    return memcmp(&lhs, &rhs, sizeof(lhs)) == 0;
    #else
    return lhs == rhs;
    #endif
}

There are two implementations of check here, but lead to suboptimal code.

When using MEMCMP, gcc trunk -O3 emits:

check(C const&, C const&):
        mov     rax, QWORD PTR [rsi]
        cmp     QWORD PTR [rdi], rax
        je      .L5
.L2:
        mov     eax, 1
        test    eax, eax
        sete    al
        ret
.L5:
        mov     eax, DWORD PTR [rsi+8]
        cmp     DWORD PTR [rdi+8], eax
        jne     .L2
        xor     eax, eax
        test    eax, eax
        sete    al
        ret

There's a few extra instructions here (mov eax, 1; test eax, eax; sete al;...
do we need all three of those to return 0?)

When using defaulted comparisons, gcc trunk -O3 doesn't collapse any of the
comparisons, and instead emits 7 distinct checks:

check(C const&, C const&):
        movzx   ecx, BYTE PTR [rsi]
        xor     eax, eax
        cmp     BYTE PTR [rdi], cl
        jne     .L1
        movzx   edx, BYTE PTR [rsi+1]
        cmp     BYTE PTR [rdi+1], dl
        jne     .L1
        movzx   edx, BYTE PTR [rsi+2]
        cmp     BYTE PTR [rdi+2], dl
        jne     .L1
        movzx   edx, BYTE PTR [rsi+3]
        cmp     BYTE PTR [rdi+3], dl
        jne     .L1
        movzx   edx, WORD PTR [rsi+4]
        cmp     WORD PTR [rdi+4], dx
        jne     .L1
        movzx   eax, WORD PTR [rsi+6]
        cmp     WORD PTR [rdi+6], ax
        mov     edx, DWORD PTR [rsi+8]
        sete    al
        cmp     DWORD PTR [rdi+8], edx
        sete    dl
        and     eax, edx
.L1:
        ret

Compare this to clang, which for both the memcmp and the default equality
versions emits this:

check(C const&, C const&):                        # @check(C const&, C const&)
        mov     rax, qword ptr [rdi]
        xor     rax, qword ptr [rsi]
        mov     ecx, dword ptr [rdi + 8]
        xor     ecx, dword ptr [rsi + 8]
        or      rcx, rax
        sete    al
        ret

Looks like there are two missing optimizations here for gcc: (1) the memcmp
does get optimized into an 8-byte and 4-byte comparison, but then the result of
that optimization doesn't get optimized further and (2) multiple trivial
comparisons don't get coalesced together.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug c++/108953] inefficient codegen for trivial equality
  2023-02-27 16:29 [Bug c++/108953] New: inefficient codegen for trivial equality barry.revzin at gmail dot com
@ 2023-02-27 16:38 ` pinskia at gcc dot gnu.org
  2023-02-27 17:11 ` [Bug tree-optimization/108953] " jakub at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-02-27 16:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108953

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
           Keywords|                            |missed-optimization

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/108953] inefficient codegen for trivial equality
  2023-02-27 16:29 [Bug c++/108953] New: inefficient codegen for trivial equality barry.revzin at gmail dot com
  2023-02-27 16:38 ` [Bug c++/108953] " pinskia at gcc dot gnu.org
@ 2023-02-27 17:11 ` jakub at gcc dot gnu.org
  2023-02-28 10:44 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-02-27 17:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108953

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|c++                         |tree-optimization
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I don't think there is anything the C++ FE should do here, user can write it
like that
by hand too:
bool check(C const& lhs, C const& rhs) {
  if (lhs.a != rhs.a) return false;
  if (lhs.b != rhs.b) return false;
  if (lhs.c != rhs.c) return false;
  if (lhs.d != rhs.d) return false;
  if (lhs.e != rhs.e) return false;
  if (lhs.f != rhs.f) return false;
  if (lhs.g != rhs.g) return false;
  return true;
}
etc., so we should pattern match this somewhere during GIMPLE opts.

The memcmp issue is a RTL issue (so I think we want actually two bugs), at
GIMPLE level it is optimized into:
  _1 = __builtin_memcmp_eq (lhs_3(D), rhs_4(D), 12);
  _5 = _1 == 0;
  return _5;
which is I think what we want.

The reason for doing conditional branch after each comparison is
targetm.compare_by_pieces_branch_ratio (DImode) tells it not to batch more than
one.
Not really sure if we should override it to something else and on which
targets.

The reason for the weird
        mov     eax, 1
        test    eax, eax
        sete    al
instead of
        xor     eax, eax
and
        xor     eax, eax
        test    eax, eax
        sete    al
instead of
        mov     eax, 1
is that the epilogue is threaded only too late where the constant propagation
through it isn't possible.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/108953] inefficient codegen for trivial equality
  2023-02-27 16:29 [Bug c++/108953] New: inefficient codegen for trivial equality barry.revzin at gmail dot com
  2023-02-27 16:38 ` [Bug c++/108953] " pinskia at gcc dot gnu.org
  2023-02-27 17:11 ` [Bug tree-optimization/108953] " jakub at gcc dot gnu.org
@ 2023-02-28 10:44 ` rguenth at gcc dot gnu.org
  2023-03-02 15:05 ` jakub at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-02-28 10:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108953

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2023-02-28
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
I think it only makes sense to combine the compares when the operand(s) are
from memory which means the machinery we have in load/store merging becomes
useful
which means we eventually want to do this in this very pass (or in the bswap
pass).

The alternative would be ifcombine but then the order of the compares in the
source could be arbitrary, so it doesn't really fit that one for the most
optimal result.

There's also a related vectorization bug in case the SLP vectorizer is the
target to look at.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/108953] inefficient codegen for trivial equality
  2023-02-27 16:29 [Bug c++/108953] New: inefficient codegen for trivial equality barry.revzin at gmail dot com
                   ` (2 preceding siblings ...)
  2023-02-28 10:44 ` rguenth at gcc dot gnu.org
@ 2023-03-02 15:05 ` jakub at gcc dot gnu.org
  2023-06-23 22:59 ` pinskia at gcc dot gnu.org
  2023-07-13 18:15 ` [Bug c++/108953] " pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-03-02 15:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108953

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
BTW, I've tried:
struct A
{
  unsigned char a, b, c, d;
  unsigned short e, f;
  int g;
  bool operator== (A const &) const = default;
};

struct B
{
  unsigned char b;
  bool operator== (B const &) const = default;
};

struct C
{
  unsigned char a, b, c;
  B d;
  unsigned short e, f;
  int g;
  bool operator== (C const &) const = default;
};

struct D
{
  unsigned char b;
  bool operator== (D const &x) const { return b == x.b; }
};

struct E
{
  unsigned char a, b, c;
  D d;
  unsigned short e, f;
  int g;
  bool operator== (E const &) const = default;
};

struct F
{
  unsigned char a, b, c, d;
  unsigned short e, f;
  int g;
  bool operator== (F const &x) const {
    if (a != x.a) return false;
    if (b != x.b) return false;
    if (c != x.c) return false;
    if (d != x.d) return false;
    if (e != x.e) return false;
    if (f != x.f) return false;
    return g == x.a;
  }
};

struct G
{
  unsigned char a, b, c, d;
  unsigned short e, f;
  float g;
  bool operator== (G const &) const = default;
};

bool
foo (A const &lhs, A const &rhs)
{
  return lhs == rhs;
}

bool
bar (C const &lhs, C const &rhs)
{
  return lhs == rhs;
}

bool
baz (E const &lhs, E const &rhs)
{
  return lhs == rhs;
}

bool
qux (F const &lhs, F const &rhs)
{
  return lhs == rhs;
}

bool
corge (G const &lhs, G const &rhs)
{
  return lhs == rhs;
}

to see what clang trunk does and they optimize this way only the defaulted
bool A::operator== (A const &) const, nothing else, so it clearly isn't a late
pattern matching.  Probably just they notice that the structure only contains
integral fields
and in that case emit memcmp for the portions without padding bits if certain
conditions are met (seems two consecutive char fields are vectorized, but 3
result in memcmp).
But certainly even the nested struct with also defaulted operator== isn't
optimized that way at all, even the unaffected consecutive parts there.
Just in case we'd like to do this as a late pattern matching optimization and
in the C++ FE.

Note, I wonder if reassoc wouldn't be the right spot instead of store-merging.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/108953] inefficient codegen for trivial equality
  2023-02-27 16:29 [Bug c++/108953] New: inefficient codegen for trivial equality barry.revzin at gmail dot com
                   ` (3 preceding siblings ...)
  2023-03-02 15:05 ` jakub at gcc dot gnu.org
@ 2023-06-23 22:59 ` pinskia at gcc dot gnu.org
  2023-07-13 18:15 ` [Bug c++/108953] " pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-06-23 22:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108953

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |madhur4127 at gmail dot com

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
*** Bug 110383 has been marked as a duplicate of this bug. ***

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug c++/108953] inefficient codegen for trivial equality
  2023-02-27 16:29 [Bug c++/108953] New: inefficient codegen for trivial equality barry.revzin at gmail dot com
                   ` (4 preceding siblings ...)
  2023-06-23 22:59 ` pinskia at gcc dot gnu.org
@ 2023-07-13 18:15 ` pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-07-13 18:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108953

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jengelh at inai dot de

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
*** Bug 103733 has been marked as a duplicate of this bug. ***

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-07-13 18:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-27 16:29 [Bug c++/108953] New: inefficient codegen for trivial equality barry.revzin at gmail dot com
2023-02-27 16:38 ` [Bug c++/108953] " pinskia at gcc dot gnu.org
2023-02-27 17:11 ` [Bug tree-optimization/108953] " jakub at gcc dot gnu.org
2023-02-28 10:44 ` rguenth at gcc dot gnu.org
2023-03-02 15:05 ` jakub at gcc dot gnu.org
2023-06-23 22:59 ` pinskia at gcc dot gnu.org
2023-07-13 18:15 ` [Bug c++/108953] " pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).