public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
@ 2023-08-26 18:29 gnu_bugzilla_gcc at catelyn dot tech
2023-08-28 7:37 ` [Bug target/111166] " rguenth at gcc dot gnu.org
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: gnu_bugzilla_gcc at catelyn dot tech @ 2023-08-26 18:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166
Bug ID: 111166
Summary: gcc unnecessarily creates vector operations for
packing 32 bit integers into struct (x86_64)
Product: gcc
Version: 13.2.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: gnu_bugzilla_gcc at catelyn dot tech
Target Milestone: ---
Created attachment 55799
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55799&action=edit
preprocessed file that triggers the bug, as requested
GCC version: gcc version 13.2.1 20230801 (GCC)
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure
--enable-languages=ada,c,c++,d,fortran,go,lto,objc,obj-c++ --enable-bootstrap
--prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man
--infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/
--with-build-config=bootstrap-lto --with-linker-hash-style=gnu
--with-system-zlib --enable-__cxa_atexit --enable-cet=auto
--enable-checking=release --enable-clocale=gnu --enable-default-pie
--enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object
--enable-libstdcxx-backtrace --enable-link-serialization=1
--enable-linker-build-id --enable-lto --enable-multilib --enable-plugin
--enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch
--disable-werror
Command used: gcc -v -save-temps weird_gcc_behaviour.c -o weird_gcc_behaviour.s
-S -O3 -mtune=generic -march=x86-64
(same behaviour is observed with -O2)
Command gives no output to stdout nor stderr, and returns with exit code 0
When compiling the function `turn_into_struct`, a simple function that packs 4
32 bit unsigned integers arguments into a simple struct holding 4 such integers
and passes that along to `do_smth_with_4_u32`, at -O2 or -O3 the generated
assembly contains a couple vector operations (`punpckldq` and `punpcklqdq`), as
well as spilling onto the stack. This does not seem like a good idea to me,
performance wise
When compiled at -Os it instead uses `salq`, `movl` (to ensure the upper 32
bits are cleared) and `orq` to pack the data together, avoiding memory
altogether, which (intuitively to me) seems like a significantly faster
implementation as it doesn't need to touch SSE nor memory
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/111166] gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
2023-08-26 18:29 [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) gnu_bugzilla_gcc at catelyn dot tech
@ 2023-08-28 7:37 ` rguenth at gcc dot gnu.org
2023-08-28 11:49 ` gnu_bugzilla_gcc at catelyn dot tech
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-28 7:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2023-08-28
Ever confirmed|0 |1
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed. We vectorize the scalar code:
> ./cc1 -quiet t.i -O2 -fopt-info-vec -fdump-tree-slp-details
weird_gcc_behaviour.c:15:41: optimized: basic block part vectorized using 16
byte vectors
generating
uint64_t turn_into_struct (u32 a, u32 b, u32 c, u32 d)
{
u32 * vectp.4;
vector(4) unsigned int * vectp.3;
struct quad_u32 D.2865;
uint64_t _11;
vector(4) unsigned int _13;
<bb 2> [local count: 1073741824]:
_13 = {a_2(D), b_4(D), c_6(D), d_8(D)};
MEM <vector(4) unsigned int> [(unsigned int *)&D.2865] = _13;
_11 = do_smth_with_4_u32 (D.2865);
D.2865 ={v} {CLOBBER(eol)};
return _11;
and
weird_gcc_behaviour.c:15:41: note: Cost model analysis:
a_2(D) 1 times scalar_store costs 12 in body
b_4(D) 1 times scalar_store costs 12 in body
c_6(D) 1 times scalar_store costs 12 in body
d_8(D) 1 times scalar_store costs 12 in body
a_2(D) 1 times vector_store costs 12 in body
node 0x5d35f70 1 times vec_construct costs 36 in prologue
weird_gcc_behaviour.c:15:41: note: Cost model analysis for part in loop 0:
Vector cost: 48
Scalar cost: 48
weird_gcc_behaviour.c:15:41: note: Basic block will be vectorized using SLP
we are choosing the vector side at same cost because we assume it would win
on code size. Practically a vector store instead of a scalar store is
also good for store forwarding.
We get
turn_into_struct:
.LFB0:
.cfi_startproc
movd %edi, %xmm1
movd %esi, %xmm4
movd %edx, %xmm0
movd %ecx, %xmm3
punpckldq %xmm4, %xmm1
punpckldq %xmm3, %xmm0
movdqa %xmm1, %xmm2
punpcklqdq %xmm0, %xmm2
movaps %xmm2, -24(%rsp)
movq -24(%rsp), %rdi
movq -16(%rsp), %rsi
jmp do_smth_with_4_u32
instead of (-fno-tree-vectorize)
turn_into_struct:
.LFB0:
.cfi_startproc
xorl %eax, %eax
movl %ecx, %r8d
movl %edi, %ecx
movl %edx, %r9d
movabsq $-4294967296, %r10
movq %rax, %rdi
xorl %edx, %edx
salq $32, %r8
andq %r10, %rdi
orq %rcx, %rdi
movq %rsi, %rcx
salq $32, %rcx
movl %edi, %esi
orq %rcx, %rsi
movq %rdx, %rcx
andq %r10, %rcx
movq %rsi, %rdi
orq %r9, %rcx
movl %ecx, %ecx
orq %r8, %rcx
movq %rcx, %rsi
jmp do_smth_with_4_u32
and our guess for code-size is correct (47 bytes for vector, 67 for scalar).
The latency for the scalar code is also quite a bit bigger. The spilling
should be OK, the store should forward nicely.
Unless you can come up with an actual benchmark showing the vector code is
slower I'd say it's not. Given it's smaller it should win on the icache
side if not executed frequently as well.
So - not a bug?
The spilling could be avoided by using movq, movhlps + movq, but it's
call handling so possibly difficult to achieve.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/111166] gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
2023-08-26 18:29 [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) gnu_bugzilla_gcc at catelyn dot tech
2023-08-28 7:37 ` [Bug target/111166] " rguenth at gcc dot gnu.org
@ 2023-08-28 11:49 ` gnu_bugzilla_gcc at catelyn dot tech
2023-08-28 11:52 ` gnu_bugzilla_gcc at catelyn dot tech
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: gnu_bugzilla_gcc at catelyn dot tech @ 2023-08-28 11:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166
--- Comment #2 from gnu_bugzilla_gcc at catelyn dot tech ---
Created attachment 55807
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55807&action=edit
preprocessed file containing the benchmark code I used
I compiled this code (although using includes for clock, CLOCKS_PER_SEC,
time_t, printf, and <stdint.h>) to an object and linked it with the
bug-triggering file (compiled with -Os, -O2 and -O3 to test all those options),
to measure the speed of the generated implementations of the bug-triggering
file
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/111166] gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
2023-08-26 18:29 [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) gnu_bugzilla_gcc at catelyn dot tech
2023-08-28 7:37 ` [Bug target/111166] " rguenth at gcc dot gnu.org
2023-08-28 11:49 ` gnu_bugzilla_gcc at catelyn dot tech
@ 2023-08-28 11:52 ` gnu_bugzilla_gcc at catelyn dot tech
2023-08-28 12:37 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: gnu_bugzilla_gcc at catelyn dot tech @ 2023-08-28 11:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166
--- Comment #3 from gnu_bugzilla_gcc at catelyn dot tech ---
(In reply to Richard Biener from comment #1)
> Unless you can come up with an actual benchmark showing the vector code is
> slower I'd say it's not. Given it's smaller it should win on the icache
> side if not executed frequently as well.
I'm not an expert in benchmarking C, so my benchmark may be incorrect, but I
compiled the same (attached preprocessed) file with -O2, -O3, and -Os into an
object file, and then compiled a benchmarking file into an object as well (to
avoid variance caused by the benchmarking file being compiled with different
optimization levels), I added a very simple implementation for
`do_smth_with_4_u32`, and ran the `turn_into_struct` function in a hot loop,
with varying (pre-generated) input data and storing the result in an array, I
timed this hot loop using `(float)clock()/CLOCKS_PER_SEC;` at the start and
end, then added up the calculated results to ensure all three programs get the
same result
on my machine (Ryzen 9 5900X) the -Os version takes ~.36s, while the -O2 and
-O3 versions take ~.43 and ~.42 seconds
I tried both -O2 and -O3 to get a slightly better view of the typical variance
between program runs, and their times are very similar, but the -Os version is
a decent amount faster (around 16%, which I'd assume is significant)
I've added the preprocessed benchmark file as well, which I then compiled with
-mtune=generic and -march=x86-64 to match the system-under-test
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/111166] gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
2023-08-26 18:29 [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) gnu_bugzilla_gcc at catelyn dot tech
` (2 preceding siblings ...)
2023-08-28 11:52 ` gnu_bugzilla_gcc at catelyn dot tech
@ 2023-08-28 12:37 ` rguenth at gcc dot gnu.org
2023-08-28 12:51 ` gnu_bugzilla_gcc at catelyn dot tech
2023-08-28 12:53 ` rguenth at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-28 12:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Depends on| |101926
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Your benchmark confirms the vectorized variant is slower, on a 7900X it's
both the memory roundtrip and the gpr->xmm move causing it. perf shows
| turn_into_struct():
1 | movd %edi,%xmm1
3 | movd %esi,%xmm4
4 | movd %edx,%xmm0
95 | movd %ecx,%xmm3
6 | punpckldq %xmm4,%xmm1
2 | punpckldq %xmm3,%xmm0
1 | movdqa %xmm1,%xmm2
| punpcklqdq %xmm0,%xmm2
5 | movaps %xmm2,-0x18(%rsp)
63 | mov -0x18(%rsp),%rdi
70 | mov -0x10(%rsp),%rsi
47 | jmp 400630 <do_smth_with_4_u32>
note the situation is difficult to rectify - ideally the vectorizer
would see that we require two 64bit register pieces but it doesn't - it sees
we store into memory.
I'll note the non-vectorized code is also far from optimal. clang
produces the following which is faster by more of the delta that
the vectorized version is slower compared to the scalar GCC variant.
turn_into_struct: # @turn_into_struct
.cfi_startproc
# %bb.0:
# kill: def $ecx killed $ecx def $rcx
# kill: def $esi killed $esi def $rsi
shlq $32, %rsi
movl %edi, %edi
orq %rsi, %rdi
shlq $32, %rcx
movl %edx, %esi
orq %rcx, %rsi
jmp do_smth_with_4_u32 # TAILCALL
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101926
[Bug 101926] [meta-bug] struct/complex/other argument passing and return should
be improved
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/111166] gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
2023-08-26 18:29 [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) gnu_bugzilla_gcc at catelyn dot tech
` (3 preceding siblings ...)
2023-08-28 12:37 ` rguenth at gcc dot gnu.org
@ 2023-08-28 12:51 ` gnu_bugzilla_gcc at catelyn dot tech
2023-08-28 12:53 ` rguenth at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: gnu_bugzilla_gcc at catelyn dot tech @ 2023-08-28 12:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166
--- Comment #5 from gnu_bugzilla_gcc at catelyn dot tech ---
(In reply to Richard Biener from comment #4)
> note the situation is difficult to rectify - ideally the vectorizer
> would see that we require two 64bit register pieces but it doesn't - it sees
> we store into memory.
right, I figured that might have been what was going on, given some of the
related issues, the vectorizer incorrectly calculating the cost beforehand
> I'll note the non-vectorized code is also far from optimal. clang
> produces the following which is faster by more of the delta that
> the vectorized version is slower compared to the scalar GCC variant.
I did notice that the GCC -Os and clang -O3 versions were different, didn't
realize that it was by that much, interesting
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/111166] gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
2023-08-26 18:29 [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) gnu_bugzilla_gcc at catelyn dot tech
` (4 preceding siblings ...)
2023-08-28 12:51 ` gnu_bugzilla_gcc at catelyn dot tech
@ 2023-08-28 12:53 ` rguenth at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-28 12:53 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |guojiufu at gcc dot gnu.org,
| |sayle at gcc dot gnu.org
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Roger was working on TImode incoming(?) argument code generation, this is
TImode outgoing argument code generation where we produce for 32bit parts
7: NOTE_INSN_BASIC_BLOCK 2
2: r84:SI=di:SI
3: r85:SI=si:SI
4: r86:SI=dx:SI
5: r87:SI=cx:SI
6: NOTE_INSN_FUNCTION_BEG
9: r88:DI=zero_extend(r84:SI)
10: r89:DI=r82:TI#0
11: r91:DI=0xffffffff00000000
12: {r90:DI=r89:DI&r91:DI;clobber flags:CC;}
13: {r92:DI=r90:DI|r88:DI;clobber flags:CC;}
14: r82:TI=r82:TI&<0xffffffffffffffff,0>|zero_extend(r92:DI)
15: r93:DI=zero_extend(r85:SI)
16: {r94:DI=r93:DI<<0x20;clobber flags:CC;}
17: r95:DI=r82:TI#0
18: r96:DI=zero_extend(r95:DI#0)
19: {r97:DI=r96:DI|r94:DI;clobber flags:CC;}
20: r82:TI=r82:TI&<0xffffffffffffffff,0>|zero_extend(r97:DI)
21: r98:DI=zero_extend(r86:SI)
22: r99:DI=r82:TI#8
23: r101:DI=0xffffffff00000000
24: {r100:DI=r99:DI&r101:DI;clobber flags:CC;}
25: {r102:DI=r100:DI|r98:DI;clobber flags:CC;}
26: r82:TI=r82:TI&<0,0xffffffffffffffff>|zero_extend(r102:DI)<<0x40
27: r103:DI=zero_extend(r87:SI)
28: {r104:DI=r103:DI<<0x20;clobber flags:CC;}
29: r105:DI=r82:TI#8
30: r106:DI=zero_extend(r105:DI#0)
31: {r107:DI=r106:DI|r104:DI;clobber flags:CC;}
32: r82:TI=r82:TI&<0,0xffffffffffffffff>|zero_extend(r107:DI)<<0x40
33: r108:DI=r82:TI#0
34: r109:DI=r82:TI#8
35: di:DI=r108:DI
36: si:DI=r109:DI
37: ax:DI=call [`do_smth_with_4_u32'] argc:0
and we fail to dissect "backwards" from the
33: r108:DI=r82:TI#0
34: r109:DI=r82:TI#8
subregs. Possibly one issue is that we re-use r82. The dual-use of r82
at the end also poses issues as combine tries to match things like
(parallel [
(set (reg:DI 108 [ D.2865 ])
(subreg:DI (reg:TI 82 [ D.2865 ]) 0))
(set (reg:TI 82 [ D.2865 ])
(ior:TI (and:TI (reg:TI 82 [ D.2865 ])
(const_wide_int 0x0ffffffffffffffff))
(ashift:TI (zero_extend:TI (reg:DI 107))
(const_int 64 [0x40]))))
])
but fails to "rename" r82 to split the parallel.
At RTL expansion time we store to D.2865 where it's DECL_RTL is r82:TI so
we can hardly fix it there. Only a later pass could figure each of the
insns fully define the reg.
Jiufu Guo is working to improve what we choose for DECL_RTL, but for
incoming params / outgoing return. This is a case where we could,
with -fno-tree-vectorize, improve DECL_RTL for an automatic var and
choose not TImode but something like a (concat:TI reg:DI reg:DI).
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-08-28 12:53 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-26 18:29 [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) gnu_bugzilla_gcc at catelyn dot tech
2023-08-28 7:37 ` [Bug target/111166] " rguenth at gcc dot gnu.org
2023-08-28 11:49 ` gnu_bugzilla_gcc at catelyn dot tech
2023-08-28 11:52 ` gnu_bugzilla_gcc at catelyn dot tech
2023-08-28 12:37 ` rguenth at gcc dot gnu.org
2023-08-28 12:51 ` gnu_bugzilla_gcc at catelyn dot tech
2023-08-28 12:53 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).