[Bug c/116274] New: x86: poor code generation with 16 byte function arguments

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/116274] New: x86: poor code generation with 16 byte function arguments
@ 2024-08-07 18:08 ripatel at wii dot dev
  2024-08-07 18:11 ` [Bug target/116274] [14/15 Regression] " pinskia at gcc dot gnu.org
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: ripatel at wii dot dev @ 2024-08-07 18:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

            Bug ID: 116274
           Summary: x86: poor code generation with 16 byte function
                    arguments
           Product: gcc
           Version: 14.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ripatel at wii dot dev
  Target Milestone: ---

The following program:

struct a { long x,y; };
long test(struct a a) { return a.x+a.y; }

compiled with

$ gcc -c -o test.o -march=x86-64-v2 -O3 test.c

Results in 15 x86_64 instructions using xmm registers when using the System V
calling convention, when it should be two (lea, ret).

$ objdump -d test.o

test.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <test>:
   0:   0f 29 4c 24 e8          movaps %xmm1,-0x18(%rsp)
   5:   48 8b 54 24 f0          mov    -0x10(%rsp),%rdx
   a:   66 48 0f 6e cf          movq   %rdi,%xmm1
   f:   66 48 0f 6e de          movq   %rsi,%xmm3
  14:   66 48 0f 3a 22 ca 01    pinsrq $0x1,%rdx,%xmm1
  1b:   66 0f 6c cb             punpcklqdq %xmm3,%xmm1
  1f:   0f 29 4c 24 e8          movaps %xmm1,-0x18(%rsp)
  24:   48 8b 44 24 f0          mov    -0x10(%rsp),%rax
  29:   66 0f 6f d1             movdqa %xmm1,%xmm2
  2d:   66 48 0f 3a 22 d0 01    pinsrq $0x1,%rax,%xmm2
  34:   66 0f 6f c2             movdqa %xmm2,%xmm0
  38:   66 0f 73 d8 08          psrldq $0x8,%xmm0
  3d:   66 0f d4 c1             paddq  %xmm1,%xmm0
  41:   66 48 0f 7e c0          movq   %xmm0,%rax
  46:   c3 

Debug information:

$ gcc -v -save-temps -c -o test.o -march=x86-64-v2 -O3 test.c
Using built-in specs.
COLLECT_GCC=gcc
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap
--enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,m2,lto --prefix=/usr
--mandir=/usr/share/man --infodir=/usr/share/info
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared
--enable-threads=posix --enable-checking=release --enable-multilib
--with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions
--enable-gnu-unique-object --enable-linker-build-id
--with-gcc-major-version-only --enable-libstdcxx-backtrace
--with-libstdcxx-zoneinfo=/usr/share/zoneinfo --with-linker-hash-style=gnu
--enable-plugin --enable-initfini-array
--with-isl=/builddir/build/BUILD/gcc-14.2.1-20240801/obj-x86_64-redhat-linux/isl-install
--enable-offload-targets=nvptx-none,amdgcn-amdhsa --enable-offload-defaulted
--without-cuda-driver --enable-gnu-indirect-function --enable-cet
--with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
--with-build-config=bootstrap-lto --enable-link-serialization=1
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 14.2.1 20240801 (Red Hat 14.2.1-1) (GCC) 
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-c' '-o' 'test.o' '-march=x86-64-v2'
'-O3'
 /usr/libexec/gcc/x86_64-redhat-linux/14/cc1 -E -quiet -v test.c
-march=x86-64-v2 -O3 -fpch-preprocess -o test.i
ignoring nonexistent directory
"/usr/lib/gcc/x86_64-redhat-linux/14/include-fixed"
ignoring nonexistent directory
"/usr/lib/gcc/x86_64-redhat-linux/14/../../../../x86_64-redhat-linux/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/lib/gcc/x86_64-redhat-linux/14/include
 /usr/local/include
 /usr/include
End of search list.
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-c' '-o' 'test.o' '-march=x86-64-v2'
'-O3'
 /usr/libexec/gcc/x86_64-redhat-linux/14/cc1 -fpreprocessed test.i -quiet
-dumpbase test.c -dumpbase-ext .c -march=x86-64-v2 -O3 -version -o test.s
GNU C17 (GCC) version 14.2.1 20240801 (Red Hat 14.2.1-1) (x86_64-redhat-linux)
        compiled by GNU C version 14.2.1 20240801 (Red Hat 14.2.1-1), GMP
version 6.2.1, MPFR version 4.2.1, MPC version 1.3.1, isl version isl-0.24-GMP

GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: 7983ab47815232989bed61515b77d1c7
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-c' '-o' 'test.o' '-march=x86-64-v2'
'-O3'
 as -v --64 -o test.o test.s
GNU assembler version 2.41 (x86_64-redhat-linux) using BFD version version
2.41-37.fc40
COMPILER_PATH=/usr/libexec/gcc/x86_64-redhat-linux/14/:/usr/libexec/gcc/x86_64-redhat-linux/14/:/usr/libexec/gcc/x86_64-redhat-linux/:/usr/lib/gcc/x86_64-redhat-linux/14/:/usr/lib/gcc/x86_64-redhat-linux/
LIBRARY_PATH=/usr/lib/gcc/x86_64-redhat-linux/14/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../../lib64/:/lib/../lib64/:/usr/lib/../lib64/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-c' '-o' 'test.o' '-march=x86-64-v2'
'-O3' '-dumpdir' 'test.'

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments
  2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
@ 2024-08-07 18:11 ` pinskia at gcc dot gnu.org
  2024-08-08  9:07 ` [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition rguenth at gcc dot gnu.org
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-08-07 18:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
   Target Milestone|---                         |13.4
            Summary|x86: poor code generation   |[14/15 Regression] x86:
                   |with 16 byte function       |poor code generation with
                   |arguments                   |16 byte function arguments
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2024-08-07
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed, comes from doing vectorization and then reduction add.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
  2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
  2024-08-07 18:11 ` [Bug target/116274] [14/15 Regression] " pinskia at gcc dot gnu.org
@ 2024-08-08  9:07 ` rguenth at gcc dot gnu.org
  2024-08-08  9:52 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-08-08  9:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sayle at gcc dot gnu.org

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
t.c:2:35: note: Cost model analysis: 
_1 + _2 1 times scalar_stmt costs 4 in body
a.x 1 times scalar_load costs 12 in body
a.y 1 times scalar_load costs 12 in body 
a.x 1 times unaligned_load (misalign -1) costs 12 in body
_1 + _2 1 times vector_stmt costs 4 in body
_1 + _2 1 times vec_perm costs 4 in body 
_1 + _2 1 times vec_to_scalar costs 4 in body
_1 + _2 0 times scalar_stmt costs 0 in body
t.c:2:35: note: Cost model analysis for part in loop 0:
  Vector cost: 24
  Scalar cost: 28
t.c:2:35: note: Basic block will be vectorized using SLP

It's vectorizer costing not knowing that a.y and a.x are readily available
in registers and thus the cost of 24 for the two loads doesn't exist.

On the vector side there's the issue that we spill.  We are expanding from

  vect__1.5_5 = MEM <vector(2) long int> [(long int *)&a];
  _6 = VIEW_CONVERT_EXPR<vector(2) unsigned long>(vect__1.5_5);
  _7 = .REDUC_PLUS (_6); [tail call]
  _8 = (long int) _7;
  return _8;

;; _7 = .REDUC_PLUS (_6); [tail call]

(insn 10 9 11 (set (reg:V1TI 108)
        (lshiftrt:V1TI (subreg:V1TI (reg/v:TI 102 [ a ]) 0)
            (const_int 64 [0x40]))) -1
     (nil))

(insn 11 10 12 (set (reg:V2DI 107)
        (subreg:V2DI (reg:V1TI 108) 0)) -1
     (nil))

(insn 12 11 13 (set (reg:V2DI 106)
        (plus:V2DI (reg:V2DI 107)
            (subreg:V2DI (reg/v:TI 102 [ a ]) 0))) -1
     (nil))

(insn 13 12 0 (set (reg:DI 100 [ _7 ])
        (vec_select:DI (reg:V2DI 106)
            (parallel [
                    (const_int 0 [0])
                ]))) -1
     (nil))

that's not unreasonable.  Note we set up TI 102 like

(insn 2 8 3 2 (set (reg:DI 104)
        (reg:DI 5 di [ a ])) "t.c":2:23 -1
     (nil))
(insn 3 2 4 2 (set (reg:DI 105)
        (reg:DI 4 si [ a+8 ])) "t.c":2:23 -1
     (nil))
(insn 4 3 5 2 (set (reg:TI 103)
        (zero_extend:TI (reg:DI 104))) "t.c":2:23 -1
     (nil))
(insn 5 4 6 2 (set (reg:TI 103)
        (ior:TI (and:TI (reg:TI 103)
                (const_wide_int 0x0ffffffffffffffff))
            (ashift:TI (zero_extend:TI (reg:DI 105))
                (const_int 64 [0x40])))) "t.c":2:23 -1
     (nil))
(insn 6 5 7 2 (set (reg/v:TI 102 [ a ])
        (reg:TI 103)) "t.c":2:23 -1
     (nil))

and the task is to "recover" from the back-and-forth.  Unfortunately
combine fails:

Trying 5, 10 -> 12:
    5: r103:TI=zero_extend(r111:DI)<<0x40|zero_extend(r110:DI)
      REG_DEAD r111:DI
      REG_DEAD r110:DI
   10: r108:V1TI=r103:TI#0 0>>0x40
   12: r106:V2DI=r108:V1TI#0+r103:TI#0
      REG_DEAD r108:V1TI
      REG_DEAD r103:TI
Failed to match this instruction: 
(set (reg:V2DI 106)
    (plus:V2DI (subreg:V2DI (lshiftrt:V1TI (subreg:V1TI (ior:TI (ashift:TI
(zero_extend:TI (reg:DI 111))
                            (const_int 64 [0x40]))
                        (zero_extend:TI (reg:DI 110))) 0)
                (const_int 64 [0x40])) 0)
        (subreg:V2DI (ior:TI (ashift:TI (zero_extend:TI (reg:DI 111))
                    (const_int 64 [0x40]))
                (zero_extend:TI (reg:DI 110))) 0)))

why we end up spilling or in the end STV2 doesn't help or what exactly
the reason is neither combine nor late-combine nor forwprop help isn't clear.

Of course the vectorizer costing is off here - load/store cost is dominating
it in general and I've mentioned decreasing the load/store costing compared
to the arithmetic stmt costing.

Still I would expect RTL optimizations to recover from this failure and
re-surrect the scalar add of the incoming register arguments.

Roger is very good at analyzing this stuff, so CCing him.

The regression is because the target now exposes the two-lane V2DImode
reduc_plus pattern (if that were fed by a much larger sequence of
vectorizable arithmetic it should be a win).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
  2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
  2024-08-07 18:11 ` [Bug target/116274] [14/15 Regression] " pinskia at gcc dot gnu.org
  2024-08-08  9:07 ` [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition rguenth at gcc dot gnu.org
@ 2024-08-08  9:52 ` rguenth at gcc dot gnu.org
  2024-08-12  8:19 ` liuhongt at gcc dot gnu.org
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-08-08  9:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
struct a { int x,y,z,w; };
int test(struct a a) { return a.x+a.y+a.z+a.w; }

behaves similarly.

I do have a patch for the vectorizer costing that avoids vectorizing in
these cases.  We will still vectorize

struct a { short a0,a1,a2,a3,a4,a5,a6,a7; };
short test(struct a a) { return a.a0+a.a1+a.a2+a.a3+a.a4+a.a5+a.a6+a.a7; }

generating

test:
.LFB0:
        .cfi_startproc
        movaps  %xmm1, -24(%rsp)
        movq    -16(%rsp), %rdx
        movq    %rdi, %xmm1
        movq    %rsi, %xmm3
        pinsrq  $1, %rdx, %xmm1
        punpcklqdq      %xmm3, %xmm1
        movaps  %xmm1, -24(%rsp)
        movdqa  %xmm1, %xmm2
        pinsrq  $1, -16(%rsp), %xmm2
        movdqa  %xmm2, %xmm0
        psrldq  $8, %xmm0
        paddw   %xmm1, %xmm0
        movdqa  %xmm0, %xmm1
        psrldq  $4, %xmm1
        paddw   %xmm1, %xmm0
        movdqa  %xmm0, %xmm1
        psrldq  $2, %xmm1
        paddw   %xmm1, %xmm0
        pextrw  $0, %xmm0, %eax
        ret

as opposed to

test:
.LFB0:
        .cfi_startproc
        movl    %edi, %eax
        movq    %rdi, %rdx
        sarl    $16, %eax
        salq    $16, %rdx
        addl    %edi, %eax
        sarq    $48, %rdx
        addl    %edx, %eax
        sarq    $48, %rdi
        movl    %esi, %edx
        addl    %edi, %eax
        sarl    $16, %edx
        addl    %esi, %eax
        addl    %edx, %eax
        movq    %rsi, %rdx
        sarq    $48, %rsi
        salq    $16, %rdx
        sarq    $48, %rdx
        addl    %edx, %eax
        addl    %esi, %eax
        ret

it still has the odd (dead)

        movaps  %xmm1, -24(%rsp)
        movq    -16(%rsp), %rdx

The

        movaps  %xmm1, -24(%rsp)
        movdqa  %xmm1, %xmm2
        pinsrq  $1, -16(%rsp), %xmm2

codegen is probably an RA/LRA artifact caused by bad instruction constraints
and the refuse to reload to a gpr.  Not sure if a move high to gpr is a thing,
pextrq would work for sure.  But an unpck looks like a better match anyway.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
  2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
                   ` (2 preceding siblings ...)
  2024-08-08  9:52 ` rguenth at gcc dot gnu.org
@ 2024-08-12  8:19 ` liuhongt at gcc dot gnu.org
  2024-08-12  9:41 ` liuhongt at gcc dot gnu.org
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-08-12  8:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

Hongtao Liu <liuhongt at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |liuhongt at gcc dot gnu.org

--- Comment #4 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
W/ below patch, compiled with -march=x86-64-v3 -O3, redundant spills is gone.

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index f044826269c..e8bcf314752 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -20292,6 +20292,10 @@ inline_secondary_memory_needed (machine_mode mode,
reg_class_t class1,
       if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2)))
        return true;

+      /* *movti_internal supports movement between SSE_REGS and GENERAL_REGS. 
*/
+      if (mode == TImode)
+       return false;
+
       int msize = GET_MODE_SIZE (mode);

       /* Between SSE and general, we have moves no larger than word size.  */


struct aq { long x,y; };
long testq(struct aq a) { return a.x+a.y; }

struct aw { short a0,a1,a2,a3,a4,a5,a6,a7; };
short testw(struct aw a) { return a.a0+a.a1+a.a2+a.a3+a.a4+a.a5+a.a6+a.a7; }

struct ad { int x,y,z,w; };
int testd(struct ad a) { return a.x+a.y+a.z+a.w; }

testq:
.LFB0:
        .cfi_startproc
        vmovq   %rdi, %xmm1
        vpinsrq $1, %rsi, %xmm1, %xmm1
        vpsrldq $8, %xmm1, %xmm0
        vpaddq  %xmm1, %xmm0, %xmm0
        vmovq   %xmm0, %rax
        ret
        .cfi_endproc
.LFE0:
        .size   testq, .-testq
        .p2align 4
        .globl  testw
        .type   testw, @function
testw:
.LFB1:
        .cfi_startproc
        vmovq   %rdi, %xmm1
        vpinsrq $1, %rsi, %xmm1, %xmm1
        vpsrldq $8, %xmm1, %xmm0
        vpaddw  %xmm1, %xmm0, %xmm0
        vpsrldq $4, %xmm0, %xmm1
        vpaddw  %xmm1, %xmm0, %xmm0
        vpsrldq $2, %xmm0, %xmm1
        vpaddw  %xmm1, %xmm0, %xmm0
        vpextrw $0, %xmm0, %eax
        ret
        .cfi_endproc
.LFE1:
        .size   testw, .-testw
        .p2align 4
        .globl  testd
        .type   testd, @function
testd:
.LFB2:
        .cfi_startproc
        vmovq   %rdi, %xmm1
        vpinsrq $1, %rsi, %xmm1, %xmm1
        vpsrldq $8, %xmm1, %xmm0
        vpaddd  %xmm1, %xmm0, %xmm0
        vpsrldq $4, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vmovd   %xmm0, %eax
        ret
        .cfi_endproc

But with -march=x86-64-v2 or -march=x86-64 -O3, the spills are still there,
hmm.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
  2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
                   ` (3 preceding siblings ...)
  2024-08-12  8:19 ` liuhongt at gcc dot gnu.org
@ 2024-08-12  9:41 ` liuhongt at gcc dot gnu.org
  2024-08-12 10:18 ` liuhongt at gcc dot gnu.org
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-08-12  9:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

--- Comment #5 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
For non-avx case, looks like it hits here

  748  /* Special case TImode to 128-bit vector conversions via V2DI.  */       
  749  if (VECTOR_MODE_P (mode)                                                 
  750      && GET_MODE_SIZE (mode) == 16                                        
  751      && SUBREG_P (op1)                                                    
  752      && GET_MODE (SUBREG_REG (op1)) == TImode                             
  753      && TARGET_64BIT && TARGET_SSE                                        
  754      && can_create_pseudo_p ())                                           
  755    {                                                                      
  756      rtx tmp = gen_reg_rtx (V2DImode);                                    
  757      rtx lo = gen_reg_rtx (DImode);                                       
  758      rtx hi = gen_reg_rtx (DImode);                                       
  759      emit_move_insn (lo, gen_lowpart (DImode, SUBREG_REG (op1)));         
  760      emit_move_insn (hi, gen_highpart (DImode, SUBREG_REG (op1)));        
  761      emit_insn (gen_vec_concatv2di (tmp, lo, hi));                        
  762      emit_move_insn (op0, gen_lowpart (mode, tmp));                       
  763      return;                                                              
  764    }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
  2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
                   ` (4 preceding siblings ...)
  2024-08-12  9:41 ` liuhongt at gcc dot gnu.org
@ 2024-08-12 10:18 ` liuhongt at gcc dot gnu.org
  2024-08-15 11:15 ` cvs-commit at gcc dot gnu.org
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-08-12 10:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

--- Comment #6 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #5)
> For non-avx case, looks like it hits here
> 
>   748  /* Special case TImode to 128-bit vector conversions via V2DI.  */   
> 


Prevent that in reload, we get

        .file   "test.c"
        .text
        .p2align 4
        .globl  testq
        .type   testq, @function
testq:
.LFB0:
        .cfi_startproc
        movq    %rdi, %xmm1
        pinsrq  $1, %rsi, %xmm1
        movdqa  %xmm1, %xmm0
        psrldq  $8, %xmm0
        paddq   %xmm1, %xmm0
        movq    %xmm0, %rax
        ret
        .cfi_endproc
.LFE0:
        .size   testq, .-testq
        .p2align 4
        .globl  testw
        .type   testw, @function
testw:
.LFB1:
        .cfi_startproc
        movq    %rdi, %xmm1
        pinsrq  $1, %rsi, %xmm1
        movdqa  %xmm1, %xmm0
        psrldq  $8, %xmm0
        paddw   %xmm1, %xmm0
        movdqa  %xmm0, %xmm1
        psrldq  $4, %xmm1
        paddw   %xmm1, %xmm0
        movdqa  %xmm0, %xmm1
        psrldq  $2, %xmm1
        paddw   %xmm1, %xmm0
        pextrw  $0, %xmm0, %eax
        ret
        .cfi_endproc
.LFE1:
        .size   testw, .-testw
        .p2align 4
        .globl  testd
        .type   testd, @function
testd:
.LFB2:
        .cfi_startproc
        movq    %rdi, %xmm1
        pinsrq  $1, %rsi, %xmm1
        movdqa  %xmm1, %xmm0
        psrldq  $8, %xmm0
        paddd   %xmm1, %xmm0
        movdqa  %xmm0, %xmm1
        psrldq  $4, %xmm1
        paddd   %xmm1, %xmm0
        movd    %xmm0, %eax
        ret
        .cfi_endproc
.LFE2:

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
  2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
                   ` (5 preceding siblings ...)
  2024-08-12 10:18 ` liuhongt at gcc dot gnu.org
@ 2024-08-15 11:15 ` cvs-commit at gcc dot gnu.org
  2024-08-15 11:16 ` liuhongt at gcc dot gnu.org
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-08-15 11:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

--- Comment #7 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:f7e672da8fc3d416a6d07eb01f3be4400ef94fac

commit r15-2930-gf7e672da8fc3d416a6d07eb01f3be4400ef94fac
Author: liuhongt <hongtao.liu@intel.com>
Date:   Mon Aug 12 18:24:34 2024 +0800

    Movement between GENERAL_REGS and SSE_REGS for TImode doesn't need
secondary reload.

    It results in 2 failures for x86_64-pc-linux-gnu{\
    -march=cascadelake};

    gcc: gcc.target/i386/extendditi3-1.c scan-assembler cqt?o
    gcc: gcc.target/i386/pr113560.c scan-assembler-times \tmulq 1

    For pr113560.c, now GCC generates mulx instead of mulq with
    -march=cascadelake, which should be optimal, so adjust testcase for
    that.
    For gcc.target/i386/extendditi2-1.c, RA happens to choose another
    register instead of rax and result in

            movq    %rdi, %rbp
            movq    %rdi, %rax
            sarq    $63, %rbp
            movq    %rbp, %rdx

    The patch adds a new define_peephole2 for that.

    gcc/ChangeLog:

            PR target/116274
            * config/i386/i386-expand.cc (ix86_expand_vector_move):
            Restrict special case TImode to 128-bit vector conversions via
            V2DI under ix86_pre_reload_split ().
            * config/i386/i386.cc (inline_secondary_memory_needed):
            Movement between GENERAL_REGS and SSE_REGS for TImode doesn't
            need secondary reload.
            * config/i386/i386.md (*extendsidi2_rex64): Add a
            define_peephole2 after it.

    gcc/testsuite/ChangeLog:

            * gcc.target/i386/pr116274.c: New test.
            * gcc.target/i386/pr113560.c: Scan either mulq or mulx.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
  2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
                   ` (6 preceding siblings ...)
  2024-08-15 11:15 ` cvs-commit at gcc dot gnu.org
@ 2024-08-15 11:16 ` liuhongt at gcc dot gnu.org
  2024-08-20  7:53 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-08-15 11:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

--- Comment #8 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---

> 
> codegen is probably an RA/LRA artifact caused by bad instruction constraints
> and the refuse to reload to a gpr.  Not sure if a move high to gpr is a
> thing,
> pextrq would work for sure.  But an unpck looks like a better match anyway.

RA issue is fixed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
  2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
                   ` (7 preceding siblings ...)
  2024-08-15 11:16 ` liuhongt at gcc dot gnu.org
@ 2024-08-20  7:53 ` rguenth at gcc dot gnu.org
  2024-08-20 11:02 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-08-20  7:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
Thanks a lot - I'm re-testing the vectorizer costing patch now.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
  2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
                   ` (8 preceding siblings ...)
  2024-08-20  7:53 ` rguenth at gcc dot gnu.org
@ 2024-08-20 11:02 ` cvs-commit at gcc dot gnu.org
  2024-09-18  9:30 ` [Bug target/116274] [14 " cvs-commit at gcc dot gnu.org
  2024-09-18  9:34 ` rguenth at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-08-20 11:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

--- Comment #10 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:b8ea13ebf1211714503fd72f25c04376483bfa53

commit r15-3036-gb8ea13ebf1211714503fd72f25c04376483bfa53
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Aug 8 11:36:43 2024 +0200

    tree-optimization/116274 - overzealous SLP vectorization

    The following tries to address that the vectorizer fails to have
    precise knowledge of argument and return calling conventions and
    views some accesses as loads and stores that are not.
    This is mainly important when doing basic-block vectorization as
    otherwise loop indexing would force such arguments to memory.

    On x86 the reduction in the number of apparent loads and stores
    often dominates cost analysis so the following tries to mitigate
    this aggressively by adjusting only the scalar load and store
    cost, reducing them to the cost of a simple scalar statement,
    but not touching the vector access cost which would be much
    harder to estimate.  Thereby we error on the side of not performing
    basic-block vectorization.

            PR tree-optimization/116274
            * tree-vect-slp.cc (vect_bb_slp_scalar_cost): Cost scalar loads
            and stores as simple scalar stmts when they access a non-global,
            not address-taken variable that doesn't have BLKmode assigned.

            * gcc.target/i386/pr116274-2.c: New testcase.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/116274] [14 Regression] x86: poor code generation with 16 byte function arguments and addition
  2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
                   ` (9 preceding siblings ...)
  2024-08-20 11:02 ` cvs-commit at gcc dot gnu.org
@ 2024-09-18  9:30 ` cvs-commit at gcc dot gnu.org
  2024-09-18  9:34 ` rguenth at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-09-18  9:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

--- Comment #11 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The releases/gcc-14 branch has been updated by Richard Biener
<rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:d5d4f3bae5a9478dc2189e53da933175a6d7b197

commit r14-10681-gd5d4f3bae5a9478dc2189e53da933175a6d7b197
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Aug 8 11:36:43 2024 +0200

    tree-optimization/116274 - overzealous SLP vectorization

    The following tries to address that the vectorizer fails to have
    precise knowledge of argument and return calling conventions and
    views some accesses as loads and stores that are not.
    This is mainly important when doing basic-block vectorization as
    otherwise loop indexing would force such arguments to memory.

    On x86 the reduction in the number of apparent loads and stores
    often dominates cost analysis so the following tries to mitigate
    this aggressively by adjusting only the scalar load and store
    cost, reducing them to the cost of a simple scalar statement,
    but not touching the vector access cost which would be much
    harder to estimate.  Thereby we error on the side of not performing
    basic-block vectorization.

            PR tree-optimization/116274
            * tree-vect-slp.cc (vect_bb_slp_scalar_cost): Cost scalar loads
            and stores as simple scalar stmts when they access a non-global,
            not address-taken variable that doesn't have BLKmode assigned.

            * gcc.target/i386/pr116274-2.c: New testcase.

    (cherry picked from commit b8ea13ebf1211714503fd72f25c04376483bfa53)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/116274] [14 Regression] x86: poor code generation with 16 byte function arguments and addition
  2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
                   ` (10 preceding siblings ...)
  2024-09-18  9:30 ` [Bug target/116274] [14 " cvs-commit at gcc dot gnu.org
@ 2024-09-18  9:34 ` rguenth at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-18  9:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Known to work|                            |14.2.1
         Resolution|---                         |FIXED
   Target Milestone|13.4                        |14.3
             Status|ASSIGNED                    |RESOLVED

--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
This backport fixed the testcase with the cost adjustments.  The RA issue is
still present for cases we'd still consider profitable.

Closing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-09-18  9:34 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
2024-08-07 18:11 ` [Bug target/116274] [14/15 Regression] " pinskia at gcc dot gnu.org
2024-08-08  9:07 ` [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition rguenth at gcc dot gnu.org
2024-08-08  9:52 ` rguenth at gcc dot gnu.org
2024-08-12  8:19 ` liuhongt at gcc dot gnu.org
2024-08-12  9:41 ` liuhongt at gcc dot gnu.org
2024-08-12 10:18 ` liuhongt at gcc dot gnu.org
2024-08-15 11:15 ` cvs-commit at gcc dot gnu.org
2024-08-15 11:16 ` liuhongt at gcc dot gnu.org
2024-08-20  7:53 ` rguenth at gcc dot gnu.org
2024-08-20 11:02 ` cvs-commit at gcc dot gnu.org
2024-09-18  9:30 ` [Bug target/116274] [14 " cvs-commit at gcc dot gnu.org
2024-09-18  9:34 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).