public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c++/108506] New: bit_cast from 32-byte vector generates worse code than memcpy
@ 2023-01-23 22:01 m.cencora at gmail dot com
2023-01-23 22:04 ` [Bug c++/108506] " m.cencora at gmail dot com
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: m.cencora at gmail dot com @ 2023-01-23 22:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108506
Bug ID: 108506
Summary: bit_cast from 32-byte vector generates worse code than
memcpy
Product: gcc
Version: 13.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: m.cencora at gmail dot com
Target Milestone: ---
Gcc trunk on x86-64 produces much worse assembly for 'deserialize' func than
for equivalent 'deserialize2'.
These two should be equivalent as bit_cast should be just a type-safe
equivalent of memcpy (that is the only difference between the two funcs).
g++ -std=c++23 -O3 -mavx2
using v32uc = unsigned char __attribute((vector_size(32)));
constexpr auto N = 1024;
struct Foo
{
int a[8];
};
static_assert(sizeof(Foo) == sizeof(v32uc));
void deserialize(const unsigned char* input, Foo* output)
{
for (auto i = 0u; i != N; ++i)
{
v32uc vec;
__builtin_memcpy(&vec, input, sizeof(vec));
input += sizeof(vec);
vec = __builtin_shuffle(vec,
v32uc{
3, 2, 1, 0,
7, 6, 5, 4,
11, 10, 9, 8,
15, 14, 13, 12,
19, 18, 17, 16,
23, 22, 21, 20,
27, 26, 25, 24,
31, 30, 29, 28
}
);
*output = __builtin_bit_cast(Foo, vec);
output++;
}
}
void deserialize2(const unsigned char* input, Foo* output)
{
for (auto i = 0u; i != N; ++i)
{
v32uc vec;
__builtin_memcpy(&vec, input, sizeof(vec));
input += sizeof(vec);
vec = __builtin_shuffle(vec,
v32uc{
3, 2, 1, 0,
7, 6, 5, 4,
11, 10, 9, 8,
15, 14, 13, 12,
19, 18, 17, 16,
23, 22, 21, 20,
27, 26, 25, 24,
31, 30, 29, 28
}
);
__builtin_memcpy(output, &vec, sizeof(vec));
output++;
}
}
Disassembly:
deserialize(unsigned char const*, Foo*):
push rbp
xor eax, eax
mov rbp, rsp
and rsp, -32
vmovdqa ymm1, YMMWORD PTR .LC0[rip]
.L2:
vmovdqu ymm3, YMMWORD PTR [rdi+rax]
vpshufb ymm2, ymm3, ymm1
vmovdqa YMMWORD PTR [rsp-32], ymm2
mov rdx, QWORD PTR [rsp-32]
mov rcx, QWORD PTR [rsp-24]
vmovdqa xmm4, XMMWORD PTR [rsp-16]
vmovq xmm0, rdx
vpinsrq xmm0, xmm0, rcx, 1
vmovdqu XMMWORD PTR [rsi+16+rax], xmm4
vmovdqu XMMWORD PTR [rsi+rax], xmm0
add rax, 32
cmp rax, 32768
jne .L2
vzeroupper
leave
ret
deserialize2(unsigned char const*, Foo*):
vmovdqa ymm1, YMMWORD PTR .LC0[rip]
xor eax, eax
.L7:
vmovdqu ymm2, YMMWORD PTR [rdi+rax]
vpshufb ymm0, ymm2, ymm1
vmovdqu YMMWORD PTR [rsi+rax], ymm0
add rax, 32
cmp rax, 32768
jne .L7
vzeroupper
ret
.LC0:
.byte 3
.byte 2
.byte 1
.byte 0
.byte 7
.byte 6
.byte 5
.byte 4
.byte 11
.byte 10
.byte 9
.byte 8
.byte 15
.byte 14
.byte 13
.byte 12
.byte 3
.byte 2
.byte 1
.byte 0
.byte 7
.byte 6
.byte 5
.byte 4
.byte 11
.byte 10
.byte 9
.byte 8
.byte 15
.byte 14
.byte 13
.byte 12
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug c++/108506] bit_cast from 32-byte vector generates worse code than memcpy
2023-01-23 22:01 [Bug c++/108506] New: bit_cast from 32-byte vector generates worse code than memcpy m.cencora at gmail dot com
@ 2023-01-23 22:04 ` m.cencora at gmail dot com
2023-01-24 0:37 ` [Bug middle-end/108506] " pinskia at gcc dot gnu.org
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: m.cencora at gmail dot com @ 2023-01-23 22:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108506
--- Comment #1 from m.cencora at gmail dot com ---
"that is the only difference between the two funcs"
I mean that deserialize and deserialize2 differ only by the way they perform
store from v32uc to output (bit_cast vs memcpy)
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug middle-end/108506] bit_cast from 32-byte vector generates worse code than memcpy
2023-01-23 22:01 [Bug c++/108506] New: bit_cast from 32-byte vector generates worse code than memcpy m.cencora at gmail dot com
2023-01-23 22:04 ` [Bug c++/108506] " m.cencora at gmail dot com
@ 2023-01-24 0:37 ` pinskia at gcc dot gnu.org
2023-01-24 0:37 ` pinskia at gcc dot gnu.org
2023-01-24 9:20 ` rguenth at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-01-24 0:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108506
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed.
Internals of what is going on:
Gimple IR
bad (__builtin_bit_cast):
MEM[(struct Foo *)output_7(D) + ivtmp.13_20 * 1] = VIEW_CONVERT_EXPR<struct
Foo>(_1);
vs good (memcpy):
MEM <vector(32) unsigned char> [(char * {ref-all})output_7(D) + ivtmp.28_20 *
1] = _1;
Both look ok really. Though the first one could be rewritten into the second
one which would fix the expansion. Though maybe it could be fixed in the
middle-end while doing the expansion of gimple to RTL.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug middle-end/108506] bit_cast from 32-byte vector generates worse code than memcpy
2023-01-23 22:01 [Bug c++/108506] New: bit_cast from 32-byte vector generates worse code than memcpy m.cencora at gmail dot com
2023-01-23 22:04 ` [Bug c++/108506] " m.cencora at gmail dot com
2023-01-24 0:37 ` [Bug middle-end/108506] " pinskia at gcc dot gnu.org
@ 2023-01-24 0:37 ` pinskia at gcc dot gnu.org
2023-01-24 9:20 ` rguenth at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-01-24 0:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108506
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2023-01-24
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug middle-end/108506] bit_cast from 32-byte vector generates worse code than memcpy
2023-01-23 22:01 [Bug c++/108506] New: bit_cast from 32-byte vector generates worse code than memcpy m.cencora at gmail dot com
` (2 preceding siblings ...)
2023-01-24 0:37 ` pinskia at gcc dot gnu.org
@ 2023-01-24 9:20 ` rguenth at gcc dot gnu.org
3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-24 9:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108506
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #2)
> Confirmed.
>
> Internals of what is going on:
>
> Gimple IR
> bad (__builtin_bit_cast):
> MEM[(struct Foo *)output_7(D) + ivtmp.13_20 * 1] =
> VIEW_CONVERT_EXPR<struct Foo>(_1);
This is an aggregate copy but the RHS is not a load - it's on the border
of invalid^Wunwanted GIMPLE.
> vs good (memcpy):
> MEM <vector(32) unsigned char> [(char * {ref-all})output_7(D) +
> ivtmp.28_20 * 1] = _1;
>
> Both look ok really. Though the first one could be rewritten into the second
> one which would fix the expansion. Though maybe it could be fixed in the
> middle-end while doing the expansion of gimple to RTL.
In other places we said we want V_C_Es on the RHS instead of on the LHS
but here we could consume the V_C_E from the MEM_REF on the LHS since
it's also a nice type to store (beware of extended precision FP types here!).
It's already gimplification / SSA rewrite producing the problematic IL.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-01-24 9:20 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-23 22:01 [Bug c++/108506] New: bit_cast from 32-byte vector generates worse code than memcpy m.cencora at gmail dot com
2023-01-23 22:04 ` [Bug c++/108506] " m.cencora at gmail dot com
2023-01-24 0:37 ` [Bug middle-end/108506] " pinskia at gcc dot gnu.org
2023-01-24 0:37 ` pinskia at gcc dot gnu.org
2023-01-24 9:20 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).