[Bug c++/102877] New: missed optimization: memcpy produces lots more asm than otherwise

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c++/102877] New: missed optimization: memcpy produces lots more asm than otherwise
@ 2021-10-21 12:33 jengelh at inai dot de
  2021-10-21 13:03 ` [Bug middle-end/102877] " rguenth at gcc dot gnu.org
  0 siblings, 1 reply; 2+ messages in thread
From: jengelh at inai dot de @ 2021-10-21 12:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102877

            Bug ID: 102877
           Summary: missed optimization: memcpy produces lots more asm
                    than otherwise
           Product: gcc
           Version: 11.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jengelh at inai dot de
  Target Milestone: ---

Input (C++)
===========
struct GLOBCNT { unsigned char ab[6]; };
unsigned long long gc_to_num(GLOBCNT gc)
{
        unsigned long long value;
        auto v = reinterpret_cast<unsigned char *>(&value);
        v[0] = 0;
        v[1] = 0;
#ifdef WITH_MEMCPY
        __builtin_memcpy(v + 2, gc.ab, 6);
#else
        v[2] = gc.ab[0]; v[3] = gc.ab[1]; v[4] = gc.ab[2];
        v[5] = gc.ab[3]; v[6] = gc.ab[4]; v[7] = gc.ab[5];
#endif
        if (__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__)
                value = __builtin_bswap64(value);
        return value;
}

I hope this is UB-free.


Observed behavior
=================
The use of memcpy/__builtin_memcpy produces a function with 28
instructions/0x5c bytes long.

► g++ -O2 -c t3.cpp -Wall -DWITH_MEMCPY -v
Target: x86_64-suse-linux
gcc version 11.2.1 20210816 [revision 056e324ce46a7924b5cf10f61010cf9dd2ca10e9]
(SUSE Linux)

► objdump -Mintel -d t3.o
0000000000000000 <_Z9gc_to_num7GLOBCNT>:
   0:   89 f8                   mov    eax,edi
   2:   89 f9                   mov    ecx,edi
   4:   89 fa                   mov    edx,edi
   6:   44 0f b6 c7             movzx  r8d,dil
   a:   c1 e9 10                shr    ecx,0x10
   d:   0f b6 f4                movzx  esi,ah
  ...
  5c:   c3                      ret    


Expected behavior
=================
► g++ -O2 -c t3.cpp -Wall -UWITH_MEMCPY
► objdump -Mintel -d t3.o
0000000000000000 <_Z9gc_to_num7GLOBCNT>:
   0:   0f b7 c7                movzx  eax,di
   3:   48 c1 ef 10             shr    rdi,0x10
   7:   48 c1 e7 20             shl    rdi,0x20
   b:   48 c1 e0 10             shl    rax,0x10
   f:   48 09 f8                or     rax,rdi
  12:   48 0f c8                bswap  rax
  15:   c3                      ret    


Other notes
===========
In a twist, clang 13.0.0 produces the short version for memcpy (even shorter
than gcc), and produces a long version for non-memcpy case (even longer than
gcc).

► clang++ -O2 -c t3.cpp -Wall -DWITH_MEMCPY; objdump -Mintel -d t3.o
0000000000000000 <_Z9gc_to_num7GLOBCNT>:
   0:   48 89 f8                mov    rax,rdi
   3:   48 c1 e0 10             shl    rax,0x10
   7:   48 0f c8                bswap  rax
   a:   c3                      ret    

► clang++ -O2 -c t3.cpp -Wall -UWITH_MEMCPY; objdump -Mintel -d t3.o
0000000000000000 <_Z9gc_to_num7GLOBCNT>:
   0:   48 89 f8                mov    rax,rdi
   3:   48 b9 ff ff ff ff ff    movabs rcx,0xffffffffffff
   a:   ff 00 00 
 ...
  6c:   c3                      ret

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug middle-end/102877] missed optimization: memcpy produces lots more asm than otherwise
  2021-10-21 12:33 [Bug c++/102877] New: missed optimization: memcpy produces lots more asm than otherwise jengelh at inai dot de
@ 2021-10-21 13:03 ` rguenth at gcc dot gnu.org
  0 siblings, 0 replies; 2+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-10-21 13:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102877

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
          Component|c++                         |middle-end
     Ever confirmed|0                           |1
           Keywords|                            |missed-optimization
   Last reconfirmed|                            |2021-10-21

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  With memcpy we expand from

  MEM <vector(2) unsigned char> [(unsigned char *)&value] = { 0, 0 };
  MEM <unsigned char[6]> [(char * {ref-all})&value + 2B] = MEM <unsigned
char[6]> [(char * {ref-all})&gc];
  value.0_1 = value;
  _2 = __builtin_bswap64 (value.0_1); [tail call]
  value ={v} {CLOBBER};
  return _2;

thus we expand 'value' on the stack.  Without memcpy we manage to do

  MEM <unsigned short> [(unsigned char *)&value] = 0;
  _19 = MEM <unsigned short> [(unsigned char *)&gc];
  MEM <unsigned short> [(unsigned char *)&value + 2B] = _19;
  _21 = MEM <unsigned int> [(unsigned char *)&gc + 2B];
  MEM <unsigned int> [(unsigned char *)&value + 4B] = _21;
  value.0_7 = value;
  _8 = __builtin_bswap64 (value.0_7); [tail call]

which also expands 'value' to the stack but is appearantly nicer to later
passes which means the way we expand the aggregate copy of type char[6]
is highly sub-optimal (we do 6 byte loads & stores).

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-10-21 13:03 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-21 12:33 [Bug c++/102877] New: missed optimization: memcpy produces lots more asm than otherwise jengelh at inai dot de
2021-10-21 13:03 ` [Bug middle-end/102877] " rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).