public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/94663] New: [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64
@ 2020-04-19 19:57 gcc at kheafield dot com
  2020-04-20  8:22 ` [Bug target/94663] " jakub at gcc dot gnu.org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: gcc at kheafield dot com @ 2020-04-19 19:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94663

            Bug ID: 94663
           Summary: [missed optimization] _mm512_dpbusds_epi32 generates
                    excess vmovdqa64
           Product: gcc
           Version: 9.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gcc at kheafield dot com
  Target Milestone: ---

The _mm512_dpbusds_epi32 intrinsic generates extra vmovdqa64 instructions when
used inside a loop.  The underlying instruction, vpdpbusds, adds to an
accumulator, so it is commonly used in loops.  The compiler appears to be
unnecessarily using two registers for the accumulator by copying it.  

Example:

#include "immintrin.h"
__m512i Slow(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t
count) {
  __m512i c0 = _mm512_setzero_epi32();
  __m512i c1 = _mm512_setzero_epi32();
  for (std::size_t i = 0; i < count; ++i) {
    c0 = _mm512_dpbusds_epi32(c0, a[i], b0);
    c1 = _mm512_dpbusds_epi32(c1, a[i], b1);
  }
  // Do not optimize away
  return _mm512_sub_epi32(c0, c1);
}

When compiled with g++ -O3 -mavx512vnni example.cc -S, the main loop is:

.L3:
        vmovdqa64       (%rdi), %zmm6
        vmovdqa64       %zmm3, %zmm0
        vmovdqa64       %zmm4, %zmm2
        addq    $64, %rdi
        vpdpbusds       %zmm5, %zmm6, %zmm0
        vpdpbusds       %zmm1, %zmm6, %zmm2
        vmovdqa64       %zmm0, %zmm3
        vmovdqa64       %zmm2, %zmm4
        cmpq    %rdi, %rax
        jne     .L3

It's copying accumulator zmm3 to zmm0, accumulating in zmm0, then copying back
to zmm3.  It should have just used one register.  The same happens for zmm4 and
zmm2.  

Workaround: use inline assembly.  

__m512i Fast(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t
count) {
  __m512i c0 = _mm512_setzero_epi32();
  __m512i c1 = _mm512_setzero_epi32();
  for (std::size_t i = 0; i < count; ++i) {
    asm ("vpdpbusds %1, %2, %0" : "+x"(c0) : "mx"(a[i]), "x"(b0));
    asm ("vpdpbusds %1, %2, %0" : "+x"(c1) : "mx"(a[i]), "x"(b1));
  }
  // Do not optimize away
  return _mm512_sub_epi32(c0, c1);
}

Here, the generated code is better, with no extra moves.  

.L10:
#APP
# 19 "example.cc" 1
        vpdpbusds (%rdi), %zmm3, %zmm0
# 0 "" 2
# 20 "example.cc" 1
        vpdpbusds (%rdi), %zmm1, %zmm2
# 0 "" 2
#NO_APP
        addq    $64, %rdi
        cmpq    %rax, %rdi
        jne     .L10

Reproduced on the following versions of g++:

g++ -v 
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/9.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with:
/var/tmp/portage/sys-devel/gcc-9.2.0-r2/work/gcc-9.2.0/configure
--host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr
--bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/9.2.0
--includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include
--datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/9.2.0
--mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/9.2.0/man
--infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/9.2.0/info
--with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9
--with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/9.2.0/python
--enable-languages=c,c++,fortran --enable-obsolete --enable-secureplt
--disable-werror --with-system-zlib --enable-nls --without-included-gettext
--enable-checking=release --with-bugurl=https://bugs.gentoo.org/
--with-pkgversion='Gentoo 9.2.0-r2 p3' --disable-esp --enable-libstdcxx-time
--enable-shared --enable-threads=posix --enable-__cxa_atexit
--enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64
--disable-altivec --disable-fixed-point --enable-targets=all --enable-libgomp
--disable-libmudflap --disable-libssp --disable-systemtap
--enable-vtable-verify --enable-lto --without-isl --enable-default-pie
--enable-default-ssp
Thread model: posix
gcc version 9.2.0 (Gentoo 9.2.0-r2 p3) 

g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/8/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
8.4.0-1ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr
--with-gcc-major-version-only --program-suffix=-8
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-vtable-verify --enable-libmpx
--enable-plugin --enable-default-pie --with-system-zlib
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--disable-werror --with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none --without-cuda-driver
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 8.4.0 (Ubuntu 8.4.0-1ubuntu1~18.04) 

Full source code:
#include <immintrin.h>
#include <cstddef>

__m512i Slow(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t
count) {
  __m512i c0 = _mm512_setzero_epi32();
  __m512i c1 = _mm512_setzero_epi32();
  for (std::size_t i = 0; i < count; ++i) {
    c0 = _mm512_dpbusds_epi32(c0, a[i], b0);
    c1 = _mm512_dpbusds_epi32(c1, a[i], b1);
  }
  // Do not optimize away
  return _mm512_sub_epi32(c0, c1);
}

__m512i Fast(const __m512i *a, const __m512i b0, const __m512i b1, std::size_t
count) {
  __m512i c0 = _mm512_setzero_epi32();
  __m512i c1 = _mm512_setzero_epi32();
  for (std::size_t i = 0; i < count; ++i) {
    asm ("vpdpbusds %1, %2, %0" : "+x"(c0) : "mx"(a[i]), "x"(b0));
    asm ("vpdpbusds %1, %2, %0" : "+x"(c1) : "mx"(a[i]), "x"(b1));
  }
  // Do not optimize away
  return _mm512_sub_epi32(c0, c1);
}

Command line: g++ -O3 -mavx512vnni -S example.cc
(It also happens with g++ -O3 -march=native -S example.cc on a Cascade Lake CPU
with g++ 8.4.0).  
Output: none

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug target/94663] [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64
  2020-04-19 19:57 [Bug target/94663] New: [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64 gcc at kheafield dot com
@ 2020-04-20  8:22 ` jakub at gcc dot gnu.org
  2020-04-27 15:05 ` vmakarov at gcc dot gnu.org
  2020-04-27 15:12 ` jakub at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-04-20  8:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94663

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
           Keywords|                            |ra
   Last reconfirmed|                            |2020-04-20
                 CC|                            |jakub at gcc dot gnu.org,
                   |                            |vmakarov at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I bet IRA is confused by the subregs.
The loop has:
(insn 30 26 32 4 (set (reg:V16SI 112)
        (unspec:V16SI [
                (subreg:V16SI (reg/v:V8DI 92 [ e ]) 0)
                (reg:V16SI 89 [ _25 ])
                (reg:V16SI 94 [ _63 ])
            ] UNSPEC_VPMADDUBSWACCSSD)) "include/avx512vnniintrin.h":66:20 5895
{vpdpbusds_v16si}
     (expr_list:REG_DEAD (reg/v:V8DI 92 [ e ])
        (nil)))
(insn 32 30 36 4 (set (reg/v:V8DI 92 [ e ])
        (subreg:V8DI (reg:V16SI 112) 0)) "include/avx512vnniintrin.h":66:10
1327 {movv8di_internal}
     (nil))
(insn 36 32 38 4 (set (reg:V16SI 116)
        (unspec:V16SI [
                (subreg:V16SI (reg/v:V8DI 88 [ f ]) 0)
                (reg:V16SI 89 [ _25 ])
                (reg:V16SI 93 [ _61 ])
            ] UNSPEC_VPMADDUBSWACCSSD)) "include/avx512vnniintrin.h":66:20 5895
{vpdpbusds_v16si}
     (expr_list:REG_DEAD (reg:V16SI 89 [ _25 ])
        (expr_list:REG_DEAD (reg/v:V8DI 88 [ f ])
            (nil))))
(insn 38 36 39 4 (set (reg/v:V8DI 88 [ f ])
        (subreg:V8DI (reg:V16SI 116) 0)) "include/avx512vnniintrin.h":66:10
1327 {movv8di_internal}
     (nil))
as the only instructions that refer pseudos 112, 92, 116, 88 and the
constraints on the vpdpbusds_v16si
insn are "=v" "0" "v" "vm", so best would be if IRA assigns the same hard
register to pseudos 112 and 92
and another one to 116 and 88.  But it actually assigns a different hard
register for each.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug target/94663] [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64
  2020-04-19 19:57 [Bug target/94663] New: [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64 gcc at kheafield dot com
  2020-04-20  8:22 ` [Bug target/94663] " jakub at gcc dot gnu.org
@ 2020-04-27 15:05 ` vmakarov at gcc dot gnu.org
  2020-04-27 15:12 ` jakub at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: vmakarov at gcc dot gnu.org @ 2020-04-27 15:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94663

--- Comment #2 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #1)
> I bet IRA is confused by the subregs.
> 
No, I don't think it is the case here.

(insn 19 18 20 4 (parallel [                                                    
            (set (reg:DI 102)                                                   
                (ashift:DI (reg/v:DI 86 [ i ])                                  
                    (const_int 6 [0x6])))                                       
            (clobber (reg:CC 17 flags))                                         
        ]) "b.cc":8:30 592 {*ashldi3_1}                                         
     (expr_list:REG_UNUSED (reg:CC 17 flags)                                    
        (nil)))                                                                 
(insn 20 19 24 4 (set (reg:V16SI 87 [ _21 ])                                    
        (mem:V16SI (plus:DI (reg/v/f:DI 98 [ a ])                               
                (reg:DI 102)) [0 MEM[base: a_14(D), index: _43, offset: 0B]+0
S64 A512])) "/home/vmakarov/build1/gcc-git/64/lib/gcc/x86_64-pc\
-linux-gnu/10.0.1/include/avx512vnniintrin.h":66:51 1324 {movv16si_internal}    
     (expr_list:REG_DEAD (reg:DI 102)                                           
        (nil)))                                                                 
(insn 24 20 26 4 (set (reg:V16SI 103)                                           
        (unspec:V16SI [                                                         
                (subreg:V16SI (reg/v:V8DI 90 [ c0 ]) 0)                         
                (reg:V16SI 87 [ _21 ])                                          
                (reg:V16SI 92 [ _40 ])                                          
            ] UNSPEC_VPMADDUBSWACCSSD))
"/home/vmakarov/build1/gcc-git/64/lib/gcc/x86_64-pc-linux-gnu/10.0.1/include/avx512vnniintrin.h":66:5\
1 5895 {vpdpbusds_v16si}                                                        
     (expr_list:REG_DEAD (reg/v:V8DI 90 [ c0 ])                                 
        (nil)))                                                                 
---->>>>>>(insn 26 24 30 4 (set (reg/v:V8DI 90 [ c0 ])                          
        (subreg:V8DI (reg:V16SI 103) 0))
"/home/vmakarov/build1/gcc-git/64/lib/gcc/x86_64-pc-linux-gnu/10.0.1/include/avx512vnniintrin.h":67:\
22 1327 {movv8di_internal}                                                      
     (nil))                                                                     
(insn 30 26 32 4 (set (reg:V16SI 107)                                           
        (unspec:V16SI [                                                         
                (subreg:V16SI (reg/v:V8DI 83 [ c1 ]) 0)                         
                (reg:V16SI 87 [ _21 ])                                          
                (reg:V16SI 91 [ _39 ])                                          
            ] UNSPEC_VPMADDUBSWACCSSD))
"/home/vmakarov/build1/gcc-git/64/lib/gcc/x86_64-pc-linux-gnu/10.0.1/include/avx512vnniintrin.h":66:5\
1 5895 {vpdpbusds_v16si}                                                        
     (expr_list:REG_DEAD (reg:V16SI 87 [ _21 ])                                 
        (expr_list:REG_DEAD (reg/v:V8DI 83 [ c1 ])                              
            (nil))))                                                            
(insn 32 30 33 4 (set (reg/v:V8DI 83 [ c1 ])                                    
        (subreg:V8DI (reg:V16SI 107) 0))
"/home/vmakarov/build1/gcc-git/64/lib/gcc/x86_64-pc-linux-gnu/10.0.1/include/avx512vnniintrin.h":67:\
22 1327 {movv8di_internal}                                                      
     (nil))                                                                     
(insn 33 32 35 4 (parallel [                                                    
            (set (reg/v:DI 86 [ i ])                                            
                (plus:DI (reg/v:DI 86 [ i ])                                    
                    (const_int 1 [0x1])))                                       
            (clobber (reg:CC 17 flags))                                         
        ]) "b.cc":7:3 186 {*adddi_1}                                            
     (expr_list:REG_UNUSED (reg:CC 17 flags)                                    
        (nil)))                                                                 
(insn 35 33 36 4 (set (reg:CCZ 17 flags)                                        
        (compare:CCZ (reg/v:DI 101 [ count ])                                   
            (reg/v:DI 86 [ i ]))) "b.cc":7:29 12 {*cmpdi_1}                     
     (nil))                                                                     
(jump_insn 36 35 37 4 (set (pc)                                                 
        (if_then_else (ne (reg:CCZ 17 flags)                                    
                (const_int 0 [0]))                                              
            (label_ref:DI 34)                                                   
            (pc))) "b.cc":7:29 736 {*jcc}                                       
     (expr_list:REG_DEAD (reg:CCZ 17 flags)                                     
        (int_list:REG_BR_PROB 955630228 (nil)))                                 
 -> 34)                                                                         
;;  succ:       5 [11.0% (guessed)]  count:105119324 (estimated locally)
(FALLTHRU,LOOP_EXIT)                                                 
;;              4 [89.0% (guessed)]  count:850510901 (estimated locally)
(DFS_BACK)                                                           
;; lr  out       6 [bp] 7 [sp] 16 [argp] 19 [frame] 83 86 90 91 92 98 101 103
107                                                             
;; live  out     6 [bp] 7 [sp] 16 [argp] 19 [frame] 83 86 **90** 91 92 98 101
**103** 107  

For example, 103 and 90 conflict.  LRA can recognize the simple case of pseudos
using the same value and living simulteneously but in this case the both
pseudos live outside BB.

GVN for RA conflict calculation could help it here but it will complicate
already complicated RA.  I tried GVN for the old RA about 15 years ago (there
is some article about this from GCC summit proceedings) but never tried to put
this work into GCC because of mixed results.

The best way to fix is to avoid to generate such code.  But I don't know is it
possible for this case.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug target/94663] [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64
  2020-04-19 19:57 [Bug target/94663] New: [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64 gcc at kheafield dot com
  2020-04-20  8:22 ` [Bug target/94663] " jakub at gcc dot gnu.org
  2020-04-27 15:05 ` vmakarov at gcc dot gnu.org
@ 2020-04-27 15:12 ` jakub at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-04-27 15:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94663

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Vladimir Makarov from comment #2)
> The best way to fix is to avoid to generate such code.  But I don't know is
> it possible for this case.

I'm afraid that is hard, because the Intel intrinsic APIs have such permanent
vector type/mode casts inherent in its APIs, while for floating point vectors
they have separate types for vectors of floats and for vectors of doubles, for
vectors of integral types they have just a single type like __m{128,256,512}i
which has just (somewhat randomly) picked one particular element type; and that
is the mode of the user vars that use the Intel APIs, and then on each
operation that needs to work with different vector mode there is cast to that
and back.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-04-27 15:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-19 19:57 [Bug target/94663] New: [missed optimization] _mm512_dpbusds_epi32 generates excess vmovdqa64 gcc at kheafield dot com
2020-04-20  8:22 ` [Bug target/94663] " jakub at gcc dot gnu.org
2020-04-27 15:05 ` vmakarov at gcc dot gnu.org
2020-04-27 15:12 ` jakub at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).