public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/116274] New: x86: poor code generation with 16 byte function arguments
@ 2024-08-07 18:08 ripatel at wii dot dev
2024-08-07 18:11 ` [Bug target/116274] [14/15 Regression] " pinskia at gcc dot gnu.org
` (11 more replies)
0 siblings, 12 replies; 13+ messages in thread
From: ripatel at wii dot dev @ 2024-08-07 18:08 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274
Bug ID: 116274
Summary: x86: poor code generation with 16 byte function
arguments
Product: gcc
Version: 14.2.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: ripatel at wii dot dev
Target Milestone: ---
The following program:
struct a { long x,y; };
long test(struct a a) { return a.x+a.y; }
compiled with
$ gcc -c -o test.o -march=x86-64-v2 -O3 test.c
Results in 15 x86_64 instructions using xmm registers when using the System V
calling convention, when it should be two (lea, ret).
$ objdump -d test.o
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <test>:
0: 0f 29 4c 24 e8 movaps %xmm1,-0x18(%rsp)
5: 48 8b 54 24 f0 mov -0x10(%rsp),%rdx
a: 66 48 0f 6e cf movq %rdi,%xmm1
f: 66 48 0f 6e de movq %rsi,%xmm3
14: 66 48 0f 3a 22 ca 01 pinsrq $0x1,%rdx,%xmm1
1b: 66 0f 6c cb punpcklqdq %xmm3,%xmm1
1f: 0f 29 4c 24 e8 movaps %xmm1,-0x18(%rsp)
24: 48 8b 44 24 f0 mov -0x10(%rsp),%rax
29: 66 0f 6f d1 movdqa %xmm1,%xmm2
2d: 66 48 0f 3a 22 d0 01 pinsrq $0x1,%rax,%xmm2
34: 66 0f 6f c2 movdqa %xmm2,%xmm0
38: 66 0f 73 d8 08 psrldq $0x8,%xmm0
3d: 66 0f d4 c1 paddq %xmm1,%xmm0
41: 66 48 0f 7e c0 movq %xmm0,%rax
46: c3
Debug information:
$ gcc -v -save-temps -c -o test.o -march=x86-64-v2 -O3 test.c
Using built-in specs.
COLLECT_GCC=gcc
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap
--enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,m2,lto --prefix=/usr
--mandir=/usr/share/man --infodir=/usr/share/info
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared
--enable-threads=posix --enable-checking=release --enable-multilib
--with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions
--enable-gnu-unique-object --enable-linker-build-id
--with-gcc-major-version-only --enable-libstdcxx-backtrace
--with-libstdcxx-zoneinfo=/usr/share/zoneinfo --with-linker-hash-style=gnu
--enable-plugin --enable-initfini-array
--with-isl=/builddir/build/BUILD/gcc-14.2.1-20240801/obj-x86_64-redhat-linux/isl-install
--enable-offload-targets=nvptx-none,amdgcn-amdhsa --enable-offload-defaulted
--without-cuda-driver --enable-gnu-indirect-function --enable-cet
--with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
--with-build-config=bootstrap-lto --enable-link-serialization=1
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 14.2.1 20240801 (Red Hat 14.2.1-1) (GCC)
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-c' '-o' 'test.o' '-march=x86-64-v2'
'-O3'
/usr/libexec/gcc/x86_64-redhat-linux/14/cc1 -E -quiet -v test.c
-march=x86-64-v2 -O3 -fpch-preprocess -o test.i
ignoring nonexistent directory
"/usr/lib/gcc/x86_64-redhat-linux/14/include-fixed"
ignoring nonexistent directory
"/usr/lib/gcc/x86_64-redhat-linux/14/../../../../x86_64-redhat-linux/include"
#include "..." search starts here:
#include <...> search starts here:
/usr/lib/gcc/x86_64-redhat-linux/14/include
/usr/local/include
/usr/include
End of search list.
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-c' '-o' 'test.o' '-march=x86-64-v2'
'-O3'
/usr/libexec/gcc/x86_64-redhat-linux/14/cc1 -fpreprocessed test.i -quiet
-dumpbase test.c -dumpbase-ext .c -march=x86-64-v2 -O3 -version -o test.s
GNU C17 (GCC) version 14.2.1 20240801 (Red Hat 14.2.1-1) (x86_64-redhat-linux)
compiled by GNU C version 14.2.1 20240801 (Red Hat 14.2.1-1), GMP
version 6.2.1, MPFR version 4.2.1, MPC version 1.3.1, isl version isl-0.24-GMP
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: 7983ab47815232989bed61515b77d1c7
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-c' '-o' 'test.o' '-march=x86-64-v2'
'-O3'
as -v --64 -o test.o test.s
GNU assembler version 2.41 (x86_64-redhat-linux) using BFD version version
2.41-37.fc40
COMPILER_PATH=/usr/libexec/gcc/x86_64-redhat-linux/14/:/usr/libexec/gcc/x86_64-redhat-linux/14/:/usr/libexec/gcc/x86_64-redhat-linux/:/usr/lib/gcc/x86_64-redhat-linux/14/:/usr/lib/gcc/x86_64-redhat-linux/
LIBRARY_PATH=/usr/lib/gcc/x86_64-redhat-linux/14/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../../lib64/:/lib/../lib64/:/usr/lib/../lib64/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-v' '-save-temps' '-c' '-o' 'test.o' '-march=x86-64-v2'
'-O3' '-dumpdir' 'test.'
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments
2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
@ 2024-08-07 18:11 ` pinskia at gcc dot gnu.org
2024-08-08 9:07 ` [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition rguenth at gcc dot gnu.org
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-08-07 18:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Target Milestone|--- |13.4
Summary|x86: poor code generation |[14/15 Regression] x86:
|with 16 byte function |poor code generation with
|arguments |16 byte function arguments
Ever confirmed|0 |1
Last reconfirmed| |2024-08-07
Status|UNCONFIRMED |NEW
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed, comes from doing vectorization and then reduction add.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
2024-08-07 18:11 ` [Bug target/116274] [14/15 Regression] " pinskia at gcc dot gnu.org
@ 2024-08-08 9:07 ` rguenth at gcc dot gnu.org
2024-08-08 9:52 ` rguenth at gcc dot gnu.org
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-08-08 9:07 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |sayle at gcc dot gnu.org
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
t.c:2:35: note: Cost model analysis:
_1 + _2 1 times scalar_stmt costs 4 in body
a.x 1 times scalar_load costs 12 in body
a.y 1 times scalar_load costs 12 in body
a.x 1 times unaligned_load (misalign -1) costs 12 in body
_1 + _2 1 times vector_stmt costs 4 in body
_1 + _2 1 times vec_perm costs 4 in body
_1 + _2 1 times vec_to_scalar costs 4 in body
_1 + _2 0 times scalar_stmt costs 0 in body
t.c:2:35: note: Cost model analysis for part in loop 0:
Vector cost: 24
Scalar cost: 28
t.c:2:35: note: Basic block will be vectorized using SLP
It's vectorizer costing not knowing that a.y and a.x are readily available
in registers and thus the cost of 24 for the two loads doesn't exist.
On the vector side there's the issue that we spill. We are expanding from
vect__1.5_5 = MEM <vector(2) long int> [(long int *)&a];
_6 = VIEW_CONVERT_EXPR<vector(2) unsigned long>(vect__1.5_5);
_7 = .REDUC_PLUS (_6); [tail call]
_8 = (long int) _7;
return _8;
;; _7 = .REDUC_PLUS (_6); [tail call]
(insn 10 9 11 (set (reg:V1TI 108)
(lshiftrt:V1TI (subreg:V1TI (reg/v:TI 102 [ a ]) 0)
(const_int 64 [0x40]))) -1
(nil))
(insn 11 10 12 (set (reg:V2DI 107)
(subreg:V2DI (reg:V1TI 108) 0)) -1
(nil))
(insn 12 11 13 (set (reg:V2DI 106)
(plus:V2DI (reg:V2DI 107)
(subreg:V2DI (reg/v:TI 102 [ a ]) 0))) -1
(nil))
(insn 13 12 0 (set (reg:DI 100 [ _7 ])
(vec_select:DI (reg:V2DI 106)
(parallel [
(const_int 0 [0])
]))) -1
(nil))
that's not unreasonable. Note we set up TI 102 like
(insn 2 8 3 2 (set (reg:DI 104)
(reg:DI 5 di [ a ])) "t.c":2:23 -1
(nil))
(insn 3 2 4 2 (set (reg:DI 105)
(reg:DI 4 si [ a+8 ])) "t.c":2:23 -1
(nil))
(insn 4 3 5 2 (set (reg:TI 103)
(zero_extend:TI (reg:DI 104))) "t.c":2:23 -1
(nil))
(insn 5 4 6 2 (set (reg:TI 103)
(ior:TI (and:TI (reg:TI 103)
(const_wide_int 0x0ffffffffffffffff))
(ashift:TI (zero_extend:TI (reg:DI 105))
(const_int 64 [0x40])))) "t.c":2:23 -1
(nil))
(insn 6 5 7 2 (set (reg/v:TI 102 [ a ])
(reg:TI 103)) "t.c":2:23 -1
(nil))
and the task is to "recover" from the back-and-forth. Unfortunately
combine fails:
Trying 5, 10 -> 12:
5: r103:TI=zero_extend(r111:DI)<<0x40|zero_extend(r110:DI)
REG_DEAD r111:DI
REG_DEAD r110:DI
10: r108:V1TI=r103:TI#0 0>>0x40
12: r106:V2DI=r108:V1TI#0+r103:TI#0
REG_DEAD r108:V1TI
REG_DEAD r103:TI
Failed to match this instruction:
(set (reg:V2DI 106)
(plus:V2DI (subreg:V2DI (lshiftrt:V1TI (subreg:V1TI (ior:TI (ashift:TI
(zero_extend:TI (reg:DI 111))
(const_int 64 [0x40]))
(zero_extend:TI (reg:DI 110))) 0)
(const_int 64 [0x40])) 0)
(subreg:V2DI (ior:TI (ashift:TI (zero_extend:TI (reg:DI 111))
(const_int 64 [0x40]))
(zero_extend:TI (reg:DI 110))) 0)))
why we end up spilling or in the end STV2 doesn't help or what exactly
the reason is neither combine nor late-combine nor forwprop help isn't clear.
Of course the vectorizer costing is off here - load/store cost is dominating
it in general and I've mentioned decreasing the load/store costing compared
to the arithmetic stmt costing.
Still I would expect RTL optimizations to recover from this failure and
re-surrect the scalar add of the incoming register arguments.
Roger is very good at analyzing this stuff, so CCing him.
The regression is because the target now exposes the two-lane V2DImode
reduc_plus pattern (if that were fed by a much larger sequence of
vectorizable arithmetic it should be a win).
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
2024-08-07 18:11 ` [Bug target/116274] [14/15 Regression] " pinskia at gcc dot gnu.org
2024-08-08 9:07 ` [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition rguenth at gcc dot gnu.org
@ 2024-08-08 9:52 ` rguenth at gcc dot gnu.org
2024-08-12 8:19 ` liuhongt at gcc dot gnu.org
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-08-08 9:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |crazylht at gmail dot com
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
struct a { int x,y,z,w; };
int test(struct a a) { return a.x+a.y+a.z+a.w; }
behaves similarly.
I do have a patch for the vectorizer costing that avoids vectorizing in
these cases. We will still vectorize
struct a { short a0,a1,a2,a3,a4,a5,a6,a7; };
short test(struct a a) { return a.a0+a.a1+a.a2+a.a3+a.a4+a.a5+a.a6+a.a7; }
generating
test:
.LFB0:
.cfi_startproc
movaps %xmm1, -24(%rsp)
movq -16(%rsp), %rdx
movq %rdi, %xmm1
movq %rsi, %xmm3
pinsrq $1, %rdx, %xmm1
punpcklqdq %xmm3, %xmm1
movaps %xmm1, -24(%rsp)
movdqa %xmm1, %xmm2
pinsrq $1, -16(%rsp), %xmm2
movdqa %xmm2, %xmm0
psrldq $8, %xmm0
paddw %xmm1, %xmm0
movdqa %xmm0, %xmm1
psrldq $4, %xmm1
paddw %xmm1, %xmm0
movdqa %xmm0, %xmm1
psrldq $2, %xmm1
paddw %xmm1, %xmm0
pextrw $0, %xmm0, %eax
ret
as opposed to
test:
.LFB0:
.cfi_startproc
movl %edi, %eax
movq %rdi, %rdx
sarl $16, %eax
salq $16, %rdx
addl %edi, %eax
sarq $48, %rdx
addl %edx, %eax
sarq $48, %rdi
movl %esi, %edx
addl %edi, %eax
sarl $16, %edx
addl %esi, %eax
addl %edx, %eax
movq %rsi, %rdx
sarq $48, %rsi
salq $16, %rdx
sarq $48, %rdx
addl %edx, %eax
addl %esi, %eax
ret
it still has the odd (dead)
movaps %xmm1, -24(%rsp)
movq -16(%rsp), %rdx
The
movaps %xmm1, -24(%rsp)
movdqa %xmm1, %xmm2
pinsrq $1, -16(%rsp), %xmm2
codegen is probably an RA/LRA artifact caused by bad instruction constraints
and the refuse to reload to a gpr. Not sure if a move high to gpr is a thing,
pextrq would work for sure. But an unpck looks like a better match anyway.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
` (2 preceding siblings ...)
2024-08-08 9:52 ` rguenth at gcc dot gnu.org
@ 2024-08-12 8:19 ` liuhongt at gcc dot gnu.org
2024-08-12 9:41 ` liuhongt at gcc dot gnu.org
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-08-12 8:19 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274
Hongtao Liu <liuhongt at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |liuhongt at gcc dot gnu.org
--- Comment #4 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
W/ below patch, compiled with -march=x86-64-v3 -O3, redundant spills is gone.
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index f044826269c..e8bcf314752 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -20292,6 +20292,10 @@ inline_secondary_memory_needed (machine_mode mode,
reg_class_t class1,
if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2)))
return true;
+ /* *movti_internal supports movement between SSE_REGS and GENERAL_REGS.
*/
+ if (mode == TImode)
+ return false;
+
int msize = GET_MODE_SIZE (mode);
/* Between SSE and general, we have moves no larger than word size. */
struct aq { long x,y; };
long testq(struct aq a) { return a.x+a.y; }
struct aw { short a0,a1,a2,a3,a4,a5,a6,a7; };
short testw(struct aw a) { return a.a0+a.a1+a.a2+a.a3+a.a4+a.a5+a.a6+a.a7; }
struct ad { int x,y,z,w; };
int testd(struct ad a) { return a.x+a.y+a.z+a.w; }
testq:
.LFB0:
.cfi_startproc
vmovq %rdi, %xmm1
vpinsrq $1, %rsi, %xmm1, %xmm1
vpsrldq $8, %xmm1, %xmm0
vpaddq %xmm1, %xmm0, %xmm0
vmovq %xmm0, %rax
ret
.cfi_endproc
.LFE0:
.size testq, .-testq
.p2align 4
.globl testw
.type testw, @function
testw:
.LFB1:
.cfi_startproc
vmovq %rdi, %xmm1
vpinsrq $1, %rsi, %xmm1, %xmm1
vpsrldq $8, %xmm1, %xmm0
vpaddw %xmm1, %xmm0, %xmm0
vpsrldq $4, %xmm0, %xmm1
vpaddw %xmm1, %xmm0, %xmm0
vpsrldq $2, %xmm0, %xmm1
vpaddw %xmm1, %xmm0, %xmm0
vpextrw $0, %xmm0, %eax
ret
.cfi_endproc
.LFE1:
.size testw, .-testw
.p2align 4
.globl testd
.type testd, @function
testd:
.LFB2:
.cfi_startproc
vmovq %rdi, %xmm1
vpinsrq $1, %rsi, %xmm1, %xmm1
vpsrldq $8, %xmm1, %xmm0
vpaddd %xmm1, %xmm0, %xmm0
vpsrldq $4, %xmm0, %xmm1
vpaddd %xmm1, %xmm0, %xmm0
vmovd %xmm0, %eax
ret
.cfi_endproc
But with -march=x86-64-v2 or -march=x86-64 -O3, the spills are still there,
hmm.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
` (3 preceding siblings ...)
2024-08-12 8:19 ` liuhongt at gcc dot gnu.org
@ 2024-08-12 9:41 ` liuhongt at gcc dot gnu.org
2024-08-12 10:18 ` liuhongt at gcc dot gnu.org
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-08-12 9:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274
--- Comment #5 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
For non-avx case, looks like it hits here
748 /* Special case TImode to 128-bit vector conversions via V2DI. */
749 if (VECTOR_MODE_P (mode)
750 && GET_MODE_SIZE (mode) == 16
751 && SUBREG_P (op1)
752 && GET_MODE (SUBREG_REG (op1)) == TImode
753 && TARGET_64BIT && TARGET_SSE
754 && can_create_pseudo_p ())
755 {
756 rtx tmp = gen_reg_rtx (V2DImode);
757 rtx lo = gen_reg_rtx (DImode);
758 rtx hi = gen_reg_rtx (DImode);
759 emit_move_insn (lo, gen_lowpart (DImode, SUBREG_REG (op1)));
760 emit_move_insn (hi, gen_highpart (DImode, SUBREG_REG (op1)));
761 emit_insn (gen_vec_concatv2di (tmp, lo, hi));
762 emit_move_insn (op0, gen_lowpart (mode, tmp));
763 return;
764 }
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
` (4 preceding siblings ...)
2024-08-12 9:41 ` liuhongt at gcc dot gnu.org
@ 2024-08-12 10:18 ` liuhongt at gcc dot gnu.org
2024-08-15 11:15 ` cvs-commit at gcc dot gnu.org
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-08-12 10:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274
--- Comment #6 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #5)
> For non-avx case, looks like it hits here
>
> 748 /* Special case TImode to 128-bit vector conversions via V2DI. */
>
Prevent that in reload, we get
.file "test.c"
.text
.p2align 4
.globl testq
.type testq, @function
testq:
.LFB0:
.cfi_startproc
movq %rdi, %xmm1
pinsrq $1, %rsi, %xmm1
movdqa %xmm1, %xmm0
psrldq $8, %xmm0
paddq %xmm1, %xmm0
movq %xmm0, %rax
ret
.cfi_endproc
.LFE0:
.size testq, .-testq
.p2align 4
.globl testw
.type testw, @function
testw:
.LFB1:
.cfi_startproc
movq %rdi, %xmm1
pinsrq $1, %rsi, %xmm1
movdqa %xmm1, %xmm0
psrldq $8, %xmm0
paddw %xmm1, %xmm0
movdqa %xmm0, %xmm1
psrldq $4, %xmm1
paddw %xmm1, %xmm0
movdqa %xmm0, %xmm1
psrldq $2, %xmm1
paddw %xmm1, %xmm0
pextrw $0, %xmm0, %eax
ret
.cfi_endproc
.LFE1:
.size testw, .-testw
.p2align 4
.globl testd
.type testd, @function
testd:
.LFB2:
.cfi_startproc
movq %rdi, %xmm1
pinsrq $1, %rsi, %xmm1
movdqa %xmm1, %xmm0
psrldq $8, %xmm0
paddd %xmm1, %xmm0
movdqa %xmm0, %xmm1
psrldq $4, %xmm1
paddd %xmm1, %xmm0
movd %xmm0, %eax
ret
.cfi_endproc
.LFE2:
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
` (5 preceding siblings ...)
2024-08-12 10:18 ` liuhongt at gcc dot gnu.org
@ 2024-08-15 11:15 ` cvs-commit at gcc dot gnu.org
2024-08-15 11:16 ` liuhongt at gcc dot gnu.org
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-08-15 11:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274
--- Comment #7 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:
https://gcc.gnu.org/g:f7e672da8fc3d416a6d07eb01f3be4400ef94fac
commit r15-2930-gf7e672da8fc3d416a6d07eb01f3be4400ef94fac
Author: liuhongt <hongtao.liu@intel.com>
Date: Mon Aug 12 18:24:34 2024 +0800
Movement between GENERAL_REGS and SSE_REGS for TImode doesn't need
secondary reload.
It results in 2 failures for x86_64-pc-linux-gnu{\
-march=cascadelake};
gcc: gcc.target/i386/extendditi3-1.c scan-assembler cqt?o
gcc: gcc.target/i386/pr113560.c scan-assembler-times \tmulq 1
For pr113560.c, now GCC generates mulx instead of mulq with
-march=cascadelake, which should be optimal, so adjust testcase for
that.
For gcc.target/i386/extendditi2-1.c, RA happens to choose another
register instead of rax and result in
movq %rdi, %rbp
movq %rdi, %rax
sarq $63, %rbp
movq %rbp, %rdx
The patch adds a new define_peephole2 for that.
gcc/ChangeLog:
PR target/116274
* config/i386/i386-expand.cc (ix86_expand_vector_move):
Restrict special case TImode to 128-bit vector conversions via
V2DI under ix86_pre_reload_split ().
* config/i386/i386.cc (inline_secondary_memory_needed):
Movement between GENERAL_REGS and SSE_REGS for TImode doesn't
need secondary reload.
* config/i386/i386.md (*extendsidi2_rex64): Add a
define_peephole2 after it.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr116274.c: New test.
* gcc.target/i386/pr113560.c: Scan either mulq or mulx.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
` (6 preceding siblings ...)
2024-08-15 11:15 ` cvs-commit at gcc dot gnu.org
@ 2024-08-15 11:16 ` liuhongt at gcc dot gnu.org
2024-08-20 7:53 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-08-15 11:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274
--- Comment #8 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
>
> codegen is probably an RA/LRA artifact caused by bad instruction constraints
> and the refuse to reload to a gpr. Not sure if a move high to gpr is a
> thing,
> pextrq would work for sure. But an unpck looks like a better match anyway.
RA issue is fixed.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
` (7 preceding siblings ...)
2024-08-15 11:16 ` liuhongt at gcc dot gnu.org
@ 2024-08-20 7:53 ` rguenth at gcc dot gnu.org
2024-08-20 11:02 ` cvs-commit at gcc dot gnu.org
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-08-20 7:53 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |ASSIGNED
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
Thanks a lot - I'm re-testing the vectorizer costing patch now.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition
2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
` (8 preceding siblings ...)
2024-08-20 7:53 ` rguenth at gcc dot gnu.org
@ 2024-08-20 11:02 ` cvs-commit at gcc dot gnu.org
2024-09-18 9:30 ` [Bug target/116274] [14 " cvs-commit at gcc dot gnu.org
2024-09-18 9:34 ` rguenth at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-08-20 11:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274
--- Comment #10 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
https://gcc.gnu.org/g:b8ea13ebf1211714503fd72f25c04376483bfa53
commit r15-3036-gb8ea13ebf1211714503fd72f25c04376483bfa53
Author: Richard Biener <rguenther@suse.de>
Date: Thu Aug 8 11:36:43 2024 +0200
tree-optimization/116274 - overzealous SLP vectorization
The following tries to address that the vectorizer fails to have
precise knowledge of argument and return calling conventions and
views some accesses as loads and stores that are not.
This is mainly important when doing basic-block vectorization as
otherwise loop indexing would force such arguments to memory.
On x86 the reduction in the number of apparent loads and stores
often dominates cost analysis so the following tries to mitigate
this aggressively by adjusting only the scalar load and store
cost, reducing them to the cost of a simple scalar statement,
but not touching the vector access cost which would be much
harder to estimate. Thereby we error on the side of not performing
basic-block vectorization.
PR tree-optimization/116274
* tree-vect-slp.cc (vect_bb_slp_scalar_cost): Cost scalar loads
and stores as simple scalar stmts when they access a non-global,
not address-taken variable that doesn't have BLKmode assigned.
* gcc.target/i386/pr116274-2.c: New testcase.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/116274] [14 Regression] x86: poor code generation with 16 byte function arguments and addition
2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
` (9 preceding siblings ...)
2024-08-20 11:02 ` cvs-commit at gcc dot gnu.org
@ 2024-09-18 9:30 ` cvs-commit at gcc dot gnu.org
2024-09-18 9:34 ` rguenth at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-09-18 9:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274
--- Comment #11 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The releases/gcc-14 branch has been updated by Richard Biener
<rguenth@gcc.gnu.org>:
https://gcc.gnu.org/g:d5d4f3bae5a9478dc2189e53da933175a6d7b197
commit r14-10681-gd5d4f3bae5a9478dc2189e53da933175a6d7b197
Author: Richard Biener <rguenther@suse.de>
Date: Thu Aug 8 11:36:43 2024 +0200
tree-optimization/116274 - overzealous SLP vectorization
The following tries to address that the vectorizer fails to have
precise knowledge of argument and return calling conventions and
views some accesses as loads and stores that are not.
This is mainly important when doing basic-block vectorization as
otherwise loop indexing would force such arguments to memory.
On x86 the reduction in the number of apparent loads and stores
often dominates cost analysis so the following tries to mitigate
this aggressively by adjusting only the scalar load and store
cost, reducing them to the cost of a simple scalar statement,
but not touching the vector access cost which would be much
harder to estimate. Thereby we error on the side of not performing
basic-block vectorization.
PR tree-optimization/116274
* tree-vect-slp.cc (vect_bb_slp_scalar_cost): Cost scalar loads
and stores as simple scalar stmts when they access a non-global,
not address-taken variable that doesn't have BLKmode assigned.
* gcc.target/i386/pr116274-2.c: New testcase.
(cherry picked from commit b8ea13ebf1211714503fd72f25c04376483bfa53)
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/116274] [14 Regression] x86: poor code generation with 16 byte function arguments and addition
2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
` (10 preceding siblings ...)
2024-09-18 9:30 ` [Bug target/116274] [14 " cvs-commit at gcc dot gnu.org
@ 2024-09-18 9:34 ` rguenth at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-18 9:34 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116274
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Known to work| |14.2.1
Resolution|--- |FIXED
Target Milestone|13.4 |14.3
Status|ASSIGNED |RESOLVED
--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
This backport fixed the testcase with the cost adjustments. The RA issue is
still present for cases we'd still consider profitable.
Closing.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-09-18 9:34 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-07 18:08 [Bug c/116274] New: x86: poor code generation with 16 byte function arguments ripatel at wii dot dev
2024-08-07 18:11 ` [Bug target/116274] [14/15 Regression] " pinskia at gcc dot gnu.org
2024-08-08 9:07 ` [Bug target/116274] [14/15 Regression] x86: poor code generation with 16 byte function arguments and addition rguenth at gcc dot gnu.org
2024-08-08 9:52 ` rguenth at gcc dot gnu.org
2024-08-12 8:19 ` liuhongt at gcc dot gnu.org
2024-08-12 9:41 ` liuhongt at gcc dot gnu.org
2024-08-12 10:18 ` liuhongt at gcc dot gnu.org
2024-08-15 11:15 ` cvs-commit at gcc dot gnu.org
2024-08-15 11:16 ` liuhongt at gcc dot gnu.org
2024-08-20 7:53 ` rguenth at gcc dot gnu.org
2024-08-20 11:02 ` cvs-commit at gcc dot gnu.org
2024-09-18 9:30 ` [Bug target/116274] [14 " cvs-commit at gcc dot gnu.org
2024-09-18 9:34 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).