public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c++/108506] New: bit_cast from 32-byte vector generates worse code than memcpy
@ 2023-01-23 22:01 m.cencora at gmail dot com
  2023-01-23 22:04 ` [Bug c++/108506] " m.cencora at gmail dot com
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: m.cencora at gmail dot com @ 2023-01-23 22:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108506

            Bug ID: 108506
           Summary: bit_cast from 32-byte vector generates worse code than
                    memcpy
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: m.cencora at gmail dot com
  Target Milestone: ---

Gcc trunk on x86-64 produces much worse assembly for 'deserialize' func than
for equivalent 'deserialize2'.
These two should be equivalent as bit_cast should be just a type-safe
equivalent of memcpy (that is the only difference between the two funcs).

g++ -std=c++23 -O3 -mavx2

using v32uc = unsigned char __attribute((vector_size(32)));

constexpr auto N = 1024;

struct Foo
{
    int a[8];
};

static_assert(sizeof(Foo) == sizeof(v32uc));

void deserialize(const unsigned char* input, Foo* output)
{
    for (auto i = 0u; i != N; ++i)
    {
        v32uc vec;
        __builtin_memcpy(&vec, input, sizeof(vec));
        input += sizeof(vec);

        vec = __builtin_shuffle(vec,
            v32uc{
                3, 2, 1, 0,
                7, 6, 5, 4,
                11, 10, 9, 8,
                15, 14, 13, 12,
                19, 18, 17, 16,
                23, 22, 21, 20,
                27, 26, 25, 24,
                31, 30, 29, 28
                }
        );
        *output = __builtin_bit_cast(Foo, vec);
        output++;
    }
}

void deserialize2(const unsigned char* input, Foo* output)
{
    for (auto i = 0u; i != N; ++i)
    {
        v32uc vec;
        __builtin_memcpy(&vec, input, sizeof(vec));
        input += sizeof(vec);

        vec = __builtin_shuffle(vec,
            v32uc{
                3, 2, 1, 0,
                7, 6, 5, 4,
                11, 10, 9, 8,
                15, 14, 13, 12,
                19, 18, 17, 16,
                23, 22, 21, 20,
                27, 26, 25, 24,
                31, 30, 29, 28
                }
        );
        __builtin_memcpy(output, &vec, sizeof(vec));
        output++;
    }
}


Disassembly:

deserialize(unsigned char const*, Foo*):
  push rbp
  xor eax, eax
  mov rbp, rsp
  and rsp, -32
  vmovdqa ymm1, YMMWORD PTR .LC0[rip]
.L2:
  vmovdqu ymm3, YMMWORD PTR [rdi+rax]
  vpshufb ymm2, ymm3, ymm1
  vmovdqa YMMWORD PTR [rsp-32], ymm2
  mov rdx, QWORD PTR [rsp-32]
  mov rcx, QWORD PTR [rsp-24]
  vmovdqa xmm4, XMMWORD PTR [rsp-16]
  vmovq xmm0, rdx
  vpinsrq xmm0, xmm0, rcx, 1
  vmovdqu XMMWORD PTR [rsi+16+rax], xmm4
  vmovdqu XMMWORD PTR [rsi+rax], xmm0
  add rax, 32
  cmp rax, 32768
  jne .L2
  vzeroupper
  leave
  ret
deserialize2(unsigned char const*, Foo*):
  vmovdqa ymm1, YMMWORD PTR .LC0[rip]
  xor eax, eax
.L7:
  vmovdqu ymm2, YMMWORD PTR [rdi+rax]
  vpshufb ymm0, ymm2, ymm1
  vmovdqu YMMWORD PTR [rsi+rax], ymm0
  add rax, 32
  cmp rax, 32768
  jne .L7
  vzeroupper
  ret
.LC0:
  .byte 3
  .byte 2
  .byte 1
  .byte 0
  .byte 7
  .byte 6
  .byte 5
  .byte 4
  .byte 11
  .byte 10
  .byte 9
  .byte 8
  .byte 15
  .byte 14
  .byte 13
  .byte 12
  .byte 3
  .byte 2
  .byte 1
  .byte 0
  .byte 7
  .byte 6
  .byte 5
  .byte 4
  .byte 11
  .byte 10
  .byte 9
  .byte 8
  .byte 15
  .byte 14
  .byte 13
  .byte 12

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug c++/108506] bit_cast from 32-byte vector generates worse code than memcpy
  2023-01-23 22:01 [Bug c++/108506] New: bit_cast from 32-byte vector generates worse code than memcpy m.cencora at gmail dot com
@ 2023-01-23 22:04 ` m.cencora at gmail dot com
  2023-01-24  0:37 ` [Bug middle-end/108506] " pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: m.cencora at gmail dot com @ 2023-01-23 22:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108506

--- Comment #1 from m.cencora at gmail dot com ---
"that is the only difference between the two funcs"
I mean that deserialize and deserialize2 differ only by the way they perform
store from v32uc to output (bit_cast vs memcpy)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug middle-end/108506] bit_cast from 32-byte vector generates worse code than memcpy
  2023-01-23 22:01 [Bug c++/108506] New: bit_cast from 32-byte vector generates worse code than memcpy m.cencora at gmail dot com
  2023-01-23 22:04 ` [Bug c++/108506] " m.cencora at gmail dot com
@ 2023-01-24  0:37 ` pinskia at gcc dot gnu.org
  2023-01-24  0:37 ` pinskia at gcc dot gnu.org
  2023-01-24  9:20 ` rguenth at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-01-24  0:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108506

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed.

Internals of what is going on:

Gimple IR
bad (__builtin_bit_cast):
  MEM[(struct Foo *)output_7(D) + ivtmp.13_20 * 1] = VIEW_CONVERT_EXPR<struct
Foo>(_1);

vs good (memcpy):
  MEM <vector(32) unsigned char> [(char * {ref-all})output_7(D) + ivtmp.28_20 *
1] = _1;


Both look ok really. Though the first one could be rewritten into the second
one which would fix the expansion. Though maybe it could be fixed in the
middle-end while doing the expansion of gimple to RTL.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug middle-end/108506] bit_cast from 32-byte vector generates worse code than memcpy
  2023-01-23 22:01 [Bug c++/108506] New: bit_cast from 32-byte vector generates worse code than memcpy m.cencora at gmail dot com
  2023-01-23 22:04 ` [Bug c++/108506] " m.cencora at gmail dot com
  2023-01-24  0:37 ` [Bug middle-end/108506] " pinskia at gcc dot gnu.org
@ 2023-01-24  0:37 ` pinskia at gcc dot gnu.org
  2023-01-24  9:20 ` rguenth at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-01-24  0:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108506

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2023-01-24
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug middle-end/108506] bit_cast from 32-byte vector generates worse code than memcpy
  2023-01-23 22:01 [Bug c++/108506] New: bit_cast from 32-byte vector generates worse code than memcpy m.cencora at gmail dot com
                   ` (2 preceding siblings ...)
  2023-01-24  0:37 ` pinskia at gcc dot gnu.org
@ 2023-01-24  9:20 ` rguenth at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-24  9:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108506

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #2)
> Confirmed.
> 
> Internals of what is going on:
> 
> Gimple IR
> bad (__builtin_bit_cast):
>   MEM[(struct Foo *)output_7(D) + ivtmp.13_20 * 1] =
> VIEW_CONVERT_EXPR<struct Foo>(_1);

This is an aggregate copy but the RHS is not a load - it's on the border
of invalid^Wunwanted GIMPLE.

> vs good (memcpy):
>   MEM <vector(32) unsigned char> [(char * {ref-all})output_7(D) +
> ivtmp.28_20 * 1] = _1;
> 
> Both look ok really. Though the first one could be rewritten into the second
> one which would fix the expansion. Though maybe it could be fixed in the
> middle-end while doing the expansion of gimple to RTL.

In other places we said we want V_C_Es on the RHS instead of on the LHS
but here we could consume the V_C_E from the MEM_REF on the LHS since
it's also a nice type to store (beware of extended precision FP types here!).

It's already gimplification / SSA rewrite producing the problematic IL.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-01-24  9:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-23 22:01 [Bug c++/108506] New: bit_cast from 32-byte vector generates worse code than memcpy m.cencora at gmail dot com
2023-01-23 22:04 ` [Bug c++/108506] " m.cencora at gmail dot com
2023-01-24  0:37 ` [Bug middle-end/108506] " pinskia at gcc dot gnu.org
2023-01-24  0:37 ` pinskia at gcc dot gnu.org
2023-01-24  9:20 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).