[Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
@ 2023-08-26 18:29 gnu_bugzilla_gcc at catelyn dot tech
  2023-08-28  7:37 ` [Bug target/111166] " rguenth at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: gnu_bugzilla_gcc at catelyn dot tech @ 2023-08-26 18:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166

            Bug ID: 111166
           Summary: gcc unnecessarily creates vector operations for
                    packing 32 bit integers into struct (x86_64)
           Product: gcc
           Version: 13.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gnu_bugzilla_gcc at catelyn dot tech
  Target Milestone: ---

Created attachment 55799
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55799&action=edit
preprocessed file that triggers the bug, as requested

GCC version: gcc version 13.2.1 20230801 (GCC)

Target: x86_64-pc-linux-gnu

Configured with: /build/gcc/src/gcc/configure
--enable-languages=ada,c,c++,d,fortran,go,lto,objc,obj-c++ --enable-bootstrap
--prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man
--infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/
--with-build-config=bootstrap-lto --with-linker-hash-style=gnu
--with-system-zlib --enable-__cxa_atexit --enable-cet=auto
--enable-checking=release --enable-clocale=gnu --enable-default-pie
--enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object
--enable-libstdcxx-backtrace --enable-link-serialization=1
--enable-linker-build-id --enable-lto --enable-multilib --enable-plugin
--enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch
--disable-werror

Command used: gcc -v -save-temps weird_gcc_behaviour.c -o weird_gcc_behaviour.s
-S -O3 -mtune=generic -march=x86-64
(same behaviour is observed with -O2)

Command gives no output to stdout nor stderr, and returns with exit code 0

When compiling the function `turn_into_struct`, a simple function that packs 4
32 bit unsigned integers arguments into a simple struct holding 4 such integers
and passes that along to `do_smth_with_4_u32`, at -O2 or -O3 the generated
assembly contains a couple vector operations (`punpckldq` and `punpcklqdq`), as
well as spilling onto the stack. This does not seem like a good idea to me,
performance wise

When compiled at -Os it instead uses `salq`, `movl` (to ensure the upper 32
bits are cleared) and `orq` to pack the data together, avoiding memory
altogether, which (intuitively to me) seems like a significantly faster
implementation as it doesn't need to touch SSE nor memory

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/111166] gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
  2023-08-26 18:29 [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) gnu_bugzilla_gcc at catelyn dot tech
@ 2023-08-28  7:37 ` rguenth at gcc dot gnu.org
  2023-08-28 11:49 ` gnu_bugzilla_gcc at catelyn dot tech
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-28  7:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2023-08-28
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  We vectorize the scalar code:

> ./cc1 -quiet t.i -O2 -fopt-info-vec -fdump-tree-slp-details
weird_gcc_behaviour.c:15:41: optimized: basic block part vectorized using 16
byte vectors

generating

uint64_t turn_into_struct (u32 a, u32 b, u32 c, u32 d)
{
  u32 * vectp.4;
  vector(4) unsigned int * vectp.3;
  struct quad_u32 D.2865;
  uint64_t _11;
  vector(4) unsigned int _13;

  <bb 2> [local count: 1073741824]:
  _13 = {a_2(D), b_4(D), c_6(D), d_8(D)};
  MEM <vector(4) unsigned int> [(unsigned int *)&D.2865] = _13;
  _11 = do_smth_with_4_u32 (D.2865);
  D.2865 ={v} {CLOBBER(eol)};
  return _11;

and

weird_gcc_behaviour.c:15:41: note: Cost model analysis:
a_2(D) 1 times scalar_store costs 12 in body
b_4(D) 1 times scalar_store costs 12 in body
c_6(D) 1 times scalar_store costs 12 in body
d_8(D) 1 times scalar_store costs 12 in body
a_2(D) 1 times vector_store costs 12 in body
node 0x5d35f70 1 times vec_construct costs 36 in prologue
weird_gcc_behaviour.c:15:41: note: Cost model analysis for part in loop 0:
  Vector cost: 48
  Scalar cost: 48
weird_gcc_behaviour.c:15:41: note: Basic block will be vectorized using SLP

we are choosing the vector side at same cost because we assume it would win
on code size.  Practically a vector store instead of a scalar store is
also good for store forwarding.

We get

turn_into_struct:
.LFB0:
        .cfi_startproc
        movd    %edi, %xmm1
        movd    %esi, %xmm4
        movd    %edx, %xmm0
        movd    %ecx, %xmm3
        punpckldq       %xmm4, %xmm1
        punpckldq       %xmm3, %xmm0
        movdqa  %xmm1, %xmm2
        punpcklqdq      %xmm0, %xmm2
        movaps  %xmm2, -24(%rsp)
        movq    -24(%rsp), %rdi
        movq    -16(%rsp), %rsi
        jmp     do_smth_with_4_u32

instead of (-fno-tree-vectorize)

turn_into_struct:
.LFB0:
        .cfi_startproc
        xorl    %eax, %eax
        movl    %ecx, %r8d
        movl    %edi, %ecx
        movl    %edx, %r9d
        movabsq $-4294967296, %r10
        movq    %rax, %rdi
        xorl    %edx, %edx
        salq    $32, %r8
        andq    %r10, %rdi
        orq     %rcx, %rdi
        movq    %rsi, %rcx
        salq    $32, %rcx
        movl    %edi, %esi
        orq     %rcx, %rsi
        movq    %rdx, %rcx
        andq    %r10, %rcx
        movq    %rsi, %rdi
        orq     %r9, %rcx
        movl    %ecx, %ecx
        orq     %r8, %rcx
        movq    %rcx, %rsi
        jmp     do_smth_with_4_u32

and our guess for code-size is correct (47 bytes for vector, 67 for scalar).
The latency for the scalar code is also quite a bit bigger.  The spilling
should be OK, the store should forward nicely.

Unless you can come up with an actual benchmark showing the vector code is
slower I'd say it's not.  Given it's smaller it should win on the icache
side if not executed frequently as well.

So - not a bug?

The spilling could be avoided by using movq, movhlps + movq, but it's
call handling so possibly difficult to achieve.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/111166] gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
  2023-08-26 18:29 [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) gnu_bugzilla_gcc at catelyn dot tech
  2023-08-28  7:37 ` [Bug target/111166] " rguenth at gcc dot gnu.org
@ 2023-08-28 11:49 ` gnu_bugzilla_gcc at catelyn dot tech
  2023-08-28 11:52 ` gnu_bugzilla_gcc at catelyn dot tech
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: gnu_bugzilla_gcc at catelyn dot tech @ 2023-08-28 11:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166

--- Comment #2 from gnu_bugzilla_gcc at catelyn dot tech ---
Created attachment 55807
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55807&action=edit
preprocessed file containing the benchmark code I used

I compiled this code (although using includes for clock, CLOCKS_PER_SEC,
time_t, printf, and <stdint.h>) to an object and linked it with the
bug-triggering file (compiled with -Os, -O2 and -O3 to test all those options),
to measure the speed of the generated implementations of the bug-triggering
file

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/111166] gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
  2023-08-26 18:29 [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) gnu_bugzilla_gcc at catelyn dot tech
  2023-08-28  7:37 ` [Bug target/111166] " rguenth at gcc dot gnu.org
  2023-08-28 11:49 ` gnu_bugzilla_gcc at catelyn dot tech
@ 2023-08-28 11:52 ` gnu_bugzilla_gcc at catelyn dot tech
  2023-08-28 12:37 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: gnu_bugzilla_gcc at catelyn dot tech @ 2023-08-28 11:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166

--- Comment #3 from gnu_bugzilla_gcc at catelyn dot tech ---
(In reply to Richard Biener from comment #1)
> Unless you can come up with an actual benchmark showing the vector code is
> slower I'd say it's not.  Given it's smaller it should win on the icache
> side if not executed frequently as well.

I'm not an expert in benchmarking C, so my benchmark may be incorrect, but I
compiled the same (attached preprocessed) file with -O2, -O3, and -Os into an
object file, and then compiled a benchmarking file into an object as well (to
avoid variance caused by the benchmarking file being compiled with different
optimization levels), I added a very simple implementation for
`do_smth_with_4_u32`, and ran the `turn_into_struct` function in a hot loop,
with varying (pre-generated) input data and storing the result in an array, I
timed this hot loop using `(float)clock()/CLOCKS_PER_SEC;` at the start and
end, then added up the calculated results to ensure all three programs get the
same result

on my machine (Ryzen 9 5900X) the -Os version takes ~.36s, while the -O2 and
-O3 versions take ~.43 and ~.42 seconds

I tried both -O2 and -O3 to get a slightly better view of the typical variance
between program runs, and their times are very similar, but the -Os version is
a decent amount faster (around 16%, which I'd assume is significant)

I've added the preprocessed benchmark file as well, which I then compiled with
-mtune=generic and -march=x86-64 to match the system-under-test

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/111166] gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
  2023-08-26 18:29 [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) gnu_bugzilla_gcc at catelyn dot tech
                   ` (2 preceding siblings ...)
  2023-08-28 11:52 ` gnu_bugzilla_gcc at catelyn dot tech
@ 2023-08-28 12:37 ` rguenth at gcc dot gnu.org
  2023-08-28 12:51 ` gnu_bugzilla_gcc at catelyn dot tech
  2023-08-28 12:53 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-28 12:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |101926

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Your benchmark confirms the vectorized variant is slower, on a 7900X it's
both the memory roundtrip and the gpr->xmm move causing it.  perf shows

       |    turn_into_struct():
     1 |      movd       %edi,%xmm1
     3 |      movd       %esi,%xmm4
     4 |      movd       %edx,%xmm0
    95 |      movd       %ecx,%xmm3
     6 |      punpckldq  %xmm4,%xmm1
     2 |      punpckldq  %xmm3,%xmm0
     1 |      movdqa     %xmm1,%xmm2
       |      punpcklqdq %xmm0,%xmm2
     5 |      movaps     %xmm2,-0x18(%rsp)
    63 |      mov        -0x18(%rsp),%rdi
    70 |      mov        -0x10(%rsp),%rsi
    47 |      jmp        400630 <do_smth_with_4_u32>

note the situation is difficult to rectify - ideally the vectorizer
would see that we require two 64bit register pieces but it doesn't - it sees
we store into memory.

I'll note the non-vectorized code is also far from optimal.  clang
produces the following which is faster by more of the delta that
the vectorized version is slower compared to the scalar GCC variant.

turn_into_struct:                       # @turn_into_struct
        .cfi_startproc
# %bb.0:
                                        # kill: def $ecx killed $ecx def $rcx
                                        # kill: def $esi killed $esi def $rsi
        shlq    $32, %rsi
        movl    %edi, %edi
        orq     %rsi, %rdi
        shlq    $32, %rcx
        movl    %edx, %esi
        orq     %rcx, %rsi
        jmp     do_smth_with_4_u32      # TAILCALL


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101926
[Bug 101926] [meta-bug] struct/complex/other argument passing and return should
be improved

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/111166] gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
  2023-08-26 18:29 [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) gnu_bugzilla_gcc at catelyn dot tech
                   ` (3 preceding siblings ...)
  2023-08-28 12:37 ` rguenth at gcc dot gnu.org
@ 2023-08-28 12:51 ` gnu_bugzilla_gcc at catelyn dot tech
  2023-08-28 12:53 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: gnu_bugzilla_gcc at catelyn dot tech @ 2023-08-28 12:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166

--- Comment #5 from gnu_bugzilla_gcc at catelyn dot tech ---
(In reply to Richard Biener from comment #4)
> note the situation is difficult to rectify - ideally the vectorizer
> would see that we require two 64bit register pieces but it doesn't - it sees
> we store into memory.

right, I figured that might have been what was going on, given some of the
related issues, the vectorizer incorrectly calculating the cost beforehand

> I'll note the non-vectorized code is also far from optimal.  clang
> produces the following which is faster by more of the delta that
> the vectorized version is slower compared to the scalar GCC variant.

I did notice that the GCC -Os and clang -O3 versions were different, didn't
realize that it was by that much, interesting

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/111166] gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64)
  2023-08-26 18:29 [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) gnu_bugzilla_gcc at catelyn dot tech
                   ` (4 preceding siblings ...)
  2023-08-28 12:51 ` gnu_bugzilla_gcc at catelyn dot tech
@ 2023-08-28 12:53 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-28 12:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |guojiufu at gcc dot gnu.org,
                   |                            |sayle at gcc dot gnu.org

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Roger was working on TImode incoming(?) argument code generation, this is
TImode outgoing argument code generation where we produce for 32bit parts

    7: NOTE_INSN_BASIC_BLOCK 2
    2: r84:SI=di:SI
    3: r85:SI=si:SI
    4: r86:SI=dx:SI
    5: r87:SI=cx:SI
    6: NOTE_INSN_FUNCTION_BEG
    9: r88:DI=zero_extend(r84:SI)
   10: r89:DI=r82:TI#0
   11: r91:DI=0xffffffff00000000
   12: {r90:DI=r89:DI&r91:DI;clobber flags:CC;}
   13: {r92:DI=r90:DI|r88:DI;clobber flags:CC;}
   14: r82:TI=r82:TI&<0xffffffffffffffff,0>|zero_extend(r92:DI)
   15: r93:DI=zero_extend(r85:SI)
   16: {r94:DI=r93:DI<<0x20;clobber flags:CC;}
   17: r95:DI=r82:TI#0
   18: r96:DI=zero_extend(r95:DI#0)
   19: {r97:DI=r96:DI|r94:DI;clobber flags:CC;}
   20: r82:TI=r82:TI&<0xffffffffffffffff,0>|zero_extend(r97:DI)
   21: r98:DI=zero_extend(r86:SI)
   22: r99:DI=r82:TI#8
   23: r101:DI=0xffffffff00000000
   24: {r100:DI=r99:DI&r101:DI;clobber flags:CC;}
   25: {r102:DI=r100:DI|r98:DI;clobber flags:CC;}
   26: r82:TI=r82:TI&<0,0xffffffffffffffff>|zero_extend(r102:DI)<<0x40
   27: r103:DI=zero_extend(r87:SI)
   28: {r104:DI=r103:DI<<0x20;clobber flags:CC;}
   29: r105:DI=r82:TI#8
   30: r106:DI=zero_extend(r105:DI#0)
   31: {r107:DI=r106:DI|r104:DI;clobber flags:CC;}
   32: r82:TI=r82:TI&<0,0xffffffffffffffff>|zero_extend(r107:DI)<<0x40
   33: r108:DI=r82:TI#0
   34: r109:DI=r82:TI#8
   35: di:DI=r108:DI
   36: si:DI=r109:DI
   37: ax:DI=call [`do_smth_with_4_u32'] argc:0

and we fail to dissect "backwards" from the

   33: r108:DI=r82:TI#0
   34: r109:DI=r82:TI#8

subregs.  Possibly one issue is that we re-use r82.  The dual-use of r82
at the end also poses issues as combine tries to match things like

(parallel [ 
        (set (reg:DI 108 [ D.2865 ])
            (subreg:DI (reg:TI 82 [ D.2865 ]) 0))
        (set (reg:TI 82 [ D.2865 ])
            (ior:TI (and:TI (reg:TI 82 [ D.2865 ])
                    (const_wide_int 0x0ffffffffffffffff))
                (ashift:TI (zero_extend:TI (reg:DI 107))
                    (const_int 64 [0x40]))))
    ])      

but fails to "rename" r82 to split the parallel.

At RTL expansion time we store to D.2865 where it's DECL_RTL is r82:TI so
we can hardly fix it there.  Only a later pass could figure each of the
insns fully define the reg.

Jiufu Guo is working to improve what we choose for DECL_RTL, but for
incoming params / outgoing return.  This is a case where we could,
with -fno-tree-vectorize, improve DECL_RTL for an automatic var and
choose not TImode but something like a (concat:TI reg:DI reg:DI).

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-08-28 12:53 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-26 18:29 [Bug tree-optimization/111166] New: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) gnu_bugzilla_gcc at catelyn dot tech
2023-08-28  7:37 ` [Bug target/111166] " rguenth at gcc dot gnu.org
2023-08-28 11:49 ` gnu_bugzilla_gcc at catelyn dot tech
2023-08-28 11:52 ` gnu_bugzilla_gcc at catelyn dot tech
2023-08-28 12:37 ` rguenth at gcc dot gnu.org
2023-08-28 12:51 ` gnu_bugzilla_gcc at catelyn dot tech
2023-08-28 12:53 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).