public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/94663] New: [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64
@ 2020-04-19 19:57 gcc at kheafield dot com
2020-04-20 8:22 ` [Bug target/94663] " jakub at gcc dot gnu.org
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: gcc at kheafield dot com @ 2020-04-19 19:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94663
Bug ID: 94663
Summary: [missed optimization] _mm512_dpbusds_epi32 generates
excess vmovdqa64
Product: gcc
Version: 9.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: gcc at kheafield dot com
Target Milestone: ---
The _mm512_dpbusds_epi32 intrinsic generates extra vmovdqa64 instructions when
used inside a loop. The underlying instruction, vpdpbusds, adds to an
accumulator, so it is commonly used in loops. The compiler appears to be
unnecessarily using two registers for the accumulator by copying it.
Example:
#include "immintrin.h"
__m512i Slow(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t
count) {
__m512i c0 = _mm512_setzero_epi32();
__m512i c1 = _mm512_setzero_epi32();
for (std::size_t i = 0; i < count; ++i) {
c0 = _mm512_dpbusds_epi32(c0, a[i], b0);
c1 = _mm512_dpbusds_epi32(c1, a[i], b1);
}
// Do not optimize away
return _mm512_sub_epi32(c0, c1);
}
When compiled with g++ -O3 -mavx512vnni example.cc -S, the main loop is:
.L3:
vmovdqa64 (%rdi), %zmm6
vmovdqa64 %zmm3, %zmm0
vmovdqa64 %zmm4, %zmm2
addq $64, %rdi
vpdpbusds %zmm5, %zmm6, %zmm0
vpdpbusds %zmm1, %zmm6, %zmm2
vmovdqa64 %zmm0, %zmm3
vmovdqa64 %zmm2, %zmm4
cmpq %rdi, %rax
jne .L3
It's copying accumulator zmm3 to zmm0, accumulating in zmm0, then copying back
to zmm3. It should have just used one register. The same happens for zmm4 and
zmm2.
Workaround: use inline assembly.
__m512i Fast(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t
count) {
__m512i c0 = _mm512_setzero_epi32();
__m512i c1 = _mm512_setzero_epi32();
for (std::size_t i = 0; i < count; ++i) {
asm ("vpdpbusds %1, %2, %0" : "+x"(c0) : "mx"(a[i]), "x"(b0));
asm ("vpdpbusds %1, %2, %0" : "+x"(c1) : "mx"(a[i]), "x"(b1));
}
// Do not optimize away
return _mm512_sub_epi32(c0, c1);
}
Here, the generated code is better, with no extra moves.
.L10:
#APP
# 19 "example.cc" 1
vpdpbusds (%rdi), %zmm3, %zmm0
# 0 "" 2
# 20 "example.cc" 1
vpdpbusds (%rdi), %zmm1, %zmm2
# 0 "" 2
#NO_APP
addq $64, %rdi
cmpq %rax, %rdi
jne .L10
Reproduced on the following versions of g++:
g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/9.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with:
/var/tmp/portage/sys-devel/gcc-9.2.0-r2/work/gcc-9.2.0/configure
--host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr
--bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/9.2.0
--includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include
--datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/9.2.0
--mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/9.2.0/man
--infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/9.2.0/info
--with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9
--with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/9.2.0/python
--enable-languages=c,c++,fortran --enable-obsolete --enable-secureplt
--disable-werror --with-system-zlib --enable-nls --without-included-gettext
--enable-checking=release --with-bugurl=https://bugs.gentoo.org/
--with-pkgversion='Gentoo 9.2.0-r2 p3' --disable-esp --enable-libstdcxx-time
--enable-shared --enable-threads=posix --enable-__cxa_atexit
--enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64
--disable-altivec --disable-fixed-point --enable-targets=all --enable-libgomp
--disable-libmudflap --disable-libssp --disable-systemtap
--enable-vtable-verify --enable-lto --without-isl --enable-default-pie
--enable-default-ssp
Thread model: posix
gcc version 9.2.0 (Gentoo 9.2.0-r2 p3)
g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/8/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
8.4.0-1ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr
--with-gcc-major-version-only --program-suffix=-8
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-vtable-verify --enable-libmpx
--enable-plugin --enable-default-pie --with-system-zlib
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--disable-werror --with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none --without-cuda-driver
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 8.4.0 (Ubuntu 8.4.0-1ubuntu1~18.04)
Full source code:
#include <immintrin.h>
#include <cstddef>
__m512i Slow(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t
count) {
__m512i c0 = _mm512_setzero_epi32();
__m512i c1 = _mm512_setzero_epi32();
for (std::size_t i = 0; i < count; ++i) {
c0 = _mm512_dpbusds_epi32(c0, a[i], b0);
c1 = _mm512_dpbusds_epi32(c1, a[i], b1);
}
// Do not optimize away
return _mm512_sub_epi32(c0, c1);
}
__m512i Fast(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t
count) {
__m512i c0 = _mm512_setzero_epi32();
__m512i c1 = _mm512_setzero_epi32();
for (std::size_t i = 0; i < count; ++i) {
asm ("vpdpbusds %1, %2, %0" : "+x"(c0) : "mx"(a[i]), "x"(b0));
asm ("vpdpbusds %1, %2, %0" : "+x"(c1) : "mx"(a[i]), "x"(b1));
}
// Do not optimize away
return _mm512_sub_epi32(c0, c1);
}
Command line: g++ -O3 -mavx512vnni -S example.cc
(It also happens with g++ -O3 -march=native -S example.cc on a Cascade Lake CPU
with g++ 8.4.0).
Output: none
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug target/94663] [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64
2020-04-19 19:57 [Bug target/94663] New: [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64 gcc at kheafield dot com
@ 2020-04-20 8:22 ` jakub at gcc dot gnu.org
2020-04-27 15:05 ` vmakarov at gcc dot gnu.org
2020-04-27 15:12 ` jakub at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-04-20 8:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94663
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
Keywords| |ra
Last reconfirmed| |2020-04-20
CC| |jakub at gcc dot gnu.org,
| |vmakarov at gcc dot gnu.org
--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I bet IRA is confused by the subregs.
The loop has:
(insn 30 26 32 4 (set (reg:V16SI 112)
(unspec:V16SI [
(subreg:V16SI (reg/v:V8DI 92 [ e ]) 0)
(reg:V16SI 89 [ _25 ])
(reg:V16SI 94 [ _63 ])
] UNSPEC_VPMADDUBSWACCSSD)) "include/avx512vnniintrin.h":66:20 5895
{vpdpbusds_v16si}
(expr_list:REG_DEAD (reg/v:V8DI 92 [ e ])
(nil)))
(insn 32 30 36 4 (set (reg/v:V8DI 92 [ e ])
(subreg:V8DI (reg:V16SI 112) 0)) "include/avx512vnniintrin.h":66:10
1327 {movv8di_internal}
(nil))
(insn 36 32 38 4 (set (reg:V16SI 116)
(unspec:V16SI [
(subreg:V16SI (reg/v:V8DI 88 [ f ]) 0)
(reg:V16SI 89 [ _25 ])
(reg:V16SI 93 [ _61 ])
] UNSPEC_VPMADDUBSWACCSSD)) "include/avx512vnniintrin.h":66:20 5895
{vpdpbusds_v16si}
(expr_list:REG_DEAD (reg:V16SI 89 [ _25 ])
(expr_list:REG_DEAD (reg/v:V8DI 88 [ f ])
(nil))))
(insn 38 36 39 4 (set (reg/v:V8DI 88 [ f ])
(subreg:V8DI (reg:V16SI 116) 0)) "include/avx512vnniintrin.h":66:10
1327 {movv8di_internal}
(nil))
as the only instructions that refer pseudos 112, 92, 116, 88 and the
constraints on the vpdpbusds_v16si
insn are "=v" "0" "v" "vm", so best would be if IRA assigns the same hard
register to pseudos 112 and 92
and another one to 116 and 88. But it actually assigns a different hard
register for each.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug target/94663] [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64
2020-04-19 19:57 [Bug target/94663] New: [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64 gcc at kheafield dot com
2020-04-20 8:22 ` [Bug target/94663] " jakub at gcc dot gnu.org
@ 2020-04-27 15:05 ` vmakarov at gcc dot gnu.org
2020-04-27 15:12 ` jakub at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: vmakarov at gcc dot gnu.org @ 2020-04-27 15:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94663
--- Comment #2 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #1)
> I bet IRA is confused by the subregs.
>
No, I don't think it is the case here.
(insn 19 18 20 4 (parallel [
(set (reg:DI 102)
(ashift:DI (reg/v:DI 86 [ i ])
(const_int 6 [0x6])))
(clobber (reg:CC 17 flags))
]) "b.cc":8:30 592 {*ashldi3_1}
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))
(insn 20 19 24 4 (set (reg:V16SI 87 [ _21 ])
(mem:V16SI (plus:DI (reg/v/f:DI 98 [ a ])
(reg:DI 102)) [0 MEM[base: a_14(D), index: _43, offset: 0B]+0
S64 A512])) "/home/vmakarov/build1/gcc-git/64/lib/gcc/x86_64-pc\
-linux-gnu/10.0.1/include/avx512vnniintrin.h":66:51 1324 {movv16si_internal}
(expr_list:REG_DEAD (reg:DI 102)
(nil)))
(insn 24 20 26 4 (set (reg:V16SI 103)
(unspec:V16SI [
(subreg:V16SI (reg/v:V8DI 90 [ c0 ]) 0)
(reg:V16SI 87 [ _21 ])
(reg:V16SI 92 [ _40 ])
] UNSPEC_VPMADDUBSWACCSSD))
"/home/vmakarov/build1/gcc-git/64/lib/gcc/x86_64-pc-linux-gnu/10.0.1/include/avx512vnniintrin.h":66:5\
1 5895 {vpdpbusds_v16si}
(expr_list:REG_DEAD (reg/v:V8DI 90 [ c0 ])
(nil)))
---->>>>>>(insn 26 24 30 4 (set (reg/v:V8DI 90 [ c0 ])
(subreg:V8DI (reg:V16SI 103) 0))
"/home/vmakarov/build1/gcc-git/64/lib/gcc/x86_64-pc-linux-gnu/10.0.1/include/avx512vnniintrin.h":67:\
22 1327 {movv8di_internal}
(nil))
(insn 30 26 32 4 (set (reg:V16SI 107)
(unspec:V16SI [
(subreg:V16SI (reg/v:V8DI 83 [ c1 ]) 0)
(reg:V16SI 87 [ _21 ])
(reg:V16SI 91 [ _39 ])
] UNSPEC_VPMADDUBSWACCSSD))
"/home/vmakarov/build1/gcc-git/64/lib/gcc/x86_64-pc-linux-gnu/10.0.1/include/avx512vnniintrin.h":66:5\
1 5895 {vpdpbusds_v16si}
(expr_list:REG_DEAD (reg:V16SI 87 [ _21 ])
(expr_list:REG_DEAD (reg/v:V8DI 83 [ c1 ])
(nil))))
(insn 32 30 33 4 (set (reg/v:V8DI 83 [ c1 ])
(subreg:V8DI (reg:V16SI 107) 0))
"/home/vmakarov/build1/gcc-git/64/lib/gcc/x86_64-pc-linux-gnu/10.0.1/include/avx512vnniintrin.h":67:\
22 1327 {movv8di_internal}
(nil))
(insn 33 32 35 4 (parallel [
(set (reg/v:DI 86 [ i ])
(plus:DI (reg/v:DI 86 [ i ])
(const_int 1 [0x1])))
(clobber (reg:CC 17 flags))
]) "b.cc":7:3 186 {*adddi_1}
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))
(insn 35 33 36 4 (set (reg:CCZ 17 flags)
(compare:CCZ (reg/v:DI 101 [ count ])
(reg/v:DI 86 [ i ]))) "b.cc":7:29 12 {*cmpdi_1}
(nil))
(jump_insn 36 35 37 4 (set (pc)
(if_then_else (ne (reg:CCZ 17 flags)
(const_int 0 [0]))
(label_ref:DI 34)
(pc))) "b.cc":7:29 736 {*jcc}
(expr_list:REG_DEAD (reg:CCZ 17 flags)
(int_list:REG_BR_PROB 955630228 (nil)))
-> 34)
;; succ: 5 [11.0% (guessed)] count:105119324 (estimated locally)
(FALLTHRU,LOOP_EXIT)
;; 4 [89.0% (guessed)] count:850510901 (estimated locally)
(DFS_BACK)
;; lr out 6 [bp] 7 [sp] 16 [argp] 19 [frame] 83 86 90 91 92 98 101 103
107
;; live out 6 [bp] 7 [sp] 16 [argp] 19 [frame] 83 86 **90** 91 92 98 101
**103** 107
For example, 103 and 90 conflict. LRA can recognize the simple case of pseudos
using the same value and living simulteneously but in this case the both
pseudos live outside BB.
GVN for RA conflict calculation could help it here but it will complicate
already complicated RA. I tried GVN for the old RA about 15 years ago (there
is some article about this from GCC summit proceedings) but never tried to put
this work into GCC because of mixed results.
The best way to fix is to avoid to generate such code. But I don't know is it
possible for this case.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug target/94663] [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64
2020-04-19 19:57 [Bug target/94663] New: [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64 gcc at kheafield dot com
2020-04-20 8:22 ` [Bug target/94663] " jakub at gcc dot gnu.org
2020-04-27 15:05 ` vmakarov at gcc dot gnu.org
@ 2020-04-27 15:12 ` jakub at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-04-27 15:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94663
--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Vladimir Makarov from comment #2)
> The best way to fix is to avoid to generate such code. But I don't know is
> it possible for this case.
I'm afraid that is hard, because the Intel intrinsic APIs have such permanent
vector type/mode casts inherent in its APIs, while for floating point vectors
they have separate types for vectors of floats and for vectors of doubles, for
vectors of integral types they have just a single type like __m{128,256,512}i
which has just (somewhat randomly) picked one particular element type; and that
is the mode of the user vars that use the Intel APIs, and then on each
operation that needs to work with different vector mode there is cast to that
and back.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-04-27 15:12 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-19 19:57 [Bug target/94663] New: [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64 gcc at kheafield dot com
2020-04-20 8:22 ` [Bug target/94663] " jakub at gcc dot gnu.org
2020-04-27 15:05 ` vmakarov at gcc dot gnu.org
2020-04-27 15:12 ` jakub at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).