public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2
@ 2015-01-22 15:15 jtaylor.debian at googlemail dot com
2015-01-22 16:17 ` [Bug tree-optimization/64731] vector lowering should split loads and stores rguenth at gcc dot gnu.org
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: jtaylor.debian at googlemail dot com @ 2015-01-22 15:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731
Bug ID: 64731
Summary: poor code when using vector_size((32)) for sse2
Product: gcc
Version: 5.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: jtaylor.debian at googlemail dot com
It would be nice if for some simple cases too large vector_size for the
selected instruction set would still produce efficient code.
E.g. using vector_size of 32 for SSE2 code results in essentially once unrolled
vector_size 16 code and it still simply uses AVX if it one compiles with the
appropriate option.
But with current gcc 5.0 with this code:
typedef double double4 __attribute__((vector_size(32)));
void fun(double * a, double * b)
{
for (int i = 0; i < 1024; i+=4) {
*(double4*)&a[i] += *(double4*)&b[i];
}
}
with AVX this turns into the expected code, but with only SSE2 enabled one gets
this:
gcc -O3 test2.c -c -std=c99
0000000000000000 <fun>:
0: 4c 8d 54 24 08 lea 0x8(%rsp),%r10
5: 48 83 e4 e0 and $0xffffffffffffffe0,%rsp
9: 31 c0 xor %eax,%eax
b: 41 ff 72 f8 pushq -0x8(%r10)
f: 55 push %rbp
10: 48 89 e5 mov %rsp,%rbp
13: 41 52 push %r10
15: 48 83 ec 10 sub $0x10,%rsp
19: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
20: 48 8b 14 07 mov (%rdi,%rax,1),%rdx
24: 48 89 55 90 mov %rdx,-0x70(%rbp)
28: 48 8b 54 07 08 mov 0x8(%rdi,%rax,1),%rdx
2d: 48 89 55 98 mov %rdx,-0x68(%rbp)
31: 48 8b 54 07 10 mov 0x10(%rdi,%rax,1),%rdx
36: 48 89 55 a0 mov %rdx,-0x60(%rbp)
3a: 48 8b 54 07 18 mov 0x18(%rdi,%rax,1),%rdx
3f: 48 89 55 a8 mov %rdx,-0x58(%rbp)
43: 48 8b 14 06 mov (%rsi,%rax,1),%rdx
47: 48 89 55 b0 mov %rdx,-0x50(%rbp)
4b: 48 8b 54 06 08 mov 0x8(%rsi,%rax,1),%rdx
50: 48 89 55 b8 mov %rdx,-0x48(%rbp)
54: 48 8b 54 06 10 mov 0x10(%rsi,%rax,1),%rdx
59: 66 0f 28 45 b0 movapd -0x50(%rbp),%xmm0
5e: 66 0f 58 45 90 addpd -0x70(%rbp),%xmm0
63: 48 89 55 c0 mov %rdx,-0x40(%rbp)
67: 48 8b 54 06 18 mov 0x18(%rsi,%rax,1),%rdx
6c: 48 89 55 c8 mov %rdx,-0x38(%rbp)
70: 0f 29 85 70 ff ff ff movaps %xmm0,-0x90(%rbp)
77: 66 48 0f 7e c2 movq %xmm0,%rdx
7c: 66 0f 28 45 c0 movapd -0x40(%rbp),%xmm0
81: 48 89 14 07 mov %rdx,(%rdi,%rax,1)
85: 48 8b 95 78 ff ff ff mov -0x88(%rbp),%rdx
8c: 66 0f 58 45 a0 addpd -0x60(%rbp),%xmm0
91: 0f 29 45 80 movaps %xmm0,-0x80(%rbp)
95: 48 89 54 07 08 mov %rdx,0x8(%rdi,%rax,1)
9a: 48 8b 55 80 mov -0x80(%rbp),%rdx
9e: 48 89 54 07 10 mov %rdx,0x10(%rdi,%rax,1)
a3: 48 8b 55 88 mov -0x78(%rbp),%rdx
a7: 48 89 54 07 18 mov %rdx,0x18(%rdi,%rax,1)
ac: 48 83 c0 20 add $0x20,%rax
b0: 48 3d 00 20 00 00 cmp $0x2000,%rax
b6: 0f 85 64 ff ff ff jne 20 <fun+0x20>
bc: 48 83 c4 10 add $0x10,%rsp
c0: 41 5a pop %r10
c2: 5d pop %rbp
c3: 49 8d 62 f8 lea -0x8(%r10),%rsp
c7: c3 retq
c8: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
cf: 00
while I would have hoped for something along the lines of this:
10: 66 0f 28 44 c6 10 movapd 0x10(%rsi,%rax,8),%xmm0
16: 66 0f 28 0c c6 movapd (%rsi,%rax,8),%xmm1
1b: 66 0f 58 0c c7 addpd (%rdi,%rax,8),%xmm1
20: 66 0f 58 44 c7 10 addpd 0x10(%rdi,%rax,8),%xmm0
26: 66 0f 29 44 c7 10 movapd %xmm0,0x10(%rdi,%rax,8)
2c: 66 0f 29 0c c7 movapd %xmm1,(%rdi,%rax,8)
31: 48 83 c0 04 add $0x4,%rax
35: 3d 00 04 00 00 cmp $0x400,%eax
3a: 7c d4 jl 10 <fun+0x10>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/64731] vector lowering should split loads and stores
2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
@ 2015-01-22 16:17 ` rguenth at gcc dot gnu.org
2015-01-22 16:23 ` jakub at gcc dot gnu.org
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-01-22 16:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2015-01-22
CC| |rguenth at gcc dot gnu.org
Summary|poor code when using |vector lowering should
|vector_size((32)) for sse2 |split loads and stores
Ever confirmed|0 |1
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Ok, the issue is "simple" - veclower doesn't split the loads/stores itself but
the
registers:
<bb 3>:
# ivtmp.11_24 = PHI <ivtmp.11_23(3), 0(2)>
_8 = MEM[base: a_6(D), index: ivtmp.11_24, offset: 0B];
_11 = MEM[base: b_9(D), index: ivtmp.11_24, offset: 0B];
_17 = BIT_FIELD_REF <_8, 128, 0>;
_4 = BIT_FIELD_REF <_11, 128, 0>;
_5 = _4 + _17;
_29 = BIT_FIELD_REF <_8, 128, 128>;
_28 = BIT_FIELD_REF <_11, 128, 128>;
_14 = _28 + _29;
_12 = {_5, _14};
MEM[base: a_6(D), index: ivtmp.11_24, offset: 0B] = _12;
ivtmp.11_23 = ivtmp.11_24 + 32;
if (ivtmp.11_23 != 8192)
goto <bb 3>;
else
goto <bb 4>;
in this case it would also have a moderately hard time to split the loads/store
as it is faced with TARGET_MEM_REFs already.
Nothing combines this back into a sane form. I've recently added code that
handles exactly the same situation but only for complex arithmetic
(in tree-ssa-forwprop.c for PR64568).
I wonder why with only -msse2 IVOPTs produces TARGET_MEM_REFs for the loads.
For sure x86_64 cannot load V4DF in one instruction...
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/64731] vector lowering should split loads and stores
2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
2015-01-22 16:17 ` [Bug tree-optimization/64731] vector lowering should split loads and stores rguenth at gcc dot gnu.org
@ 2015-01-22 16:23 ` jakub at gcc dot gnu.org
2015-01-22 16:29 ` rguenther at suse dot de
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-01-22 16:23 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731
--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Doesn't look like a regression, I see roughly same code quality all the way
from 4.1 which I tried first to current trunk.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/64731] vector lowering should split loads and stores
2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
2015-01-22 16:17 ` [Bug tree-optimization/64731] vector lowering should split loads and stores rguenth at gcc dot gnu.org
2015-01-22 16:23 ` jakub at gcc dot gnu.org
@ 2015-01-22 16:29 ` rguenther at suse dot de
2015-01-22 18:21 ` glisse at gcc dot gnu.org
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: rguenther at suse dot de @ 2015-01-22 16:29 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731
--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 22 Jan 2015, jakub at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731
>
> --- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> Doesn't look like a regression, I see roughly same code quality all the way
> from 4.1 which I tried first to current trunk.
Yes, it's not a regression and a fix has to wait for GCC 6.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/64731] vector lowering should split loads and stores
2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
` (2 preceding siblings ...)
2015-01-22 16:29 ` rguenther at suse dot de
@ 2015-01-22 18:21 ` glisse at gcc dot gnu.org
2023-05-12 4:38 ` pinskia at gcc dot gnu.org
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: glisse at gcc dot gnu.org @ 2015-01-22 18:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731
--- Comment #4 from Marc Glisse <glisse at gcc dot gnu.org> ---
Seems strongly related to PR55266 (which also references PR52436).
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/64731] vector lowering should split loads and stores
2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
` (3 preceding siblings ...)
2015-01-22 18:21 ` glisse at gcc dot gnu.org
@ 2023-05-12 4:38 ` pinskia at gcc dot gnu.org
2023-05-12 6:42 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-05-12 4:38 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731
--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
On the trunk we get at -O2:
.L2:
leaq (%rdi,%rax), %rcx
leaq (%rsi,%rax), %rdx
movapd (%rdx), %xmm0
addpd (%rcx), %xmm0
movaps %xmm0, -64(%rsp)
movapd 16(%rcx), %xmm0
addpd 16(%rdx), %xmm0
movaps %xmm0, -48(%rsp)
movdqa -64(%rsp), %xmm0
movaps %xmm0, (%rdi,%rax)
movdqa -48(%rsp), %xmm0
movaps %xmm0, 16(%rdi,%rax)
addq $32, %rax
cmpq $8192, %rax
jne .L2
Which is definitely better than before.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/64731] vector lowering should split loads and stores
2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
` (4 preceding siblings ...)
2023-05-12 4:38 ` pinskia at gcc dot gnu.org
@ 2023-05-12 6:42 ` rguenth at gcc dot gnu.org
2023-05-12 13:04 ` cvs-commit at gcc dot gnu.org
2023-05-12 13:05 ` rguenth at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-12 6:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
Status|NEW |ASSIGNED
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Yes, forwprop splits the vector loads:
_5 = &MEM[(double4 *)a_11(D) + ivtmp.14_22 * 1];
_1 = BIT_FIELD_REF <MEM[(double4 *)_5], 128, 128>;
_25 = BIT_FIELD_REF <MEM[(double4 *)_5], 128, 0>;
_14 = &MEM[(double4 *)b_12(D) + ivtmp.14_22 * 1];
_2 = BIT_FIELD_REF <MEM[(double4 *)_14], 128, 128>;
_17 = BIT_FIELD_REF <MEM[(double4 *)_14], 128, 0>;
_24 = _17 + _25;
_3 = _1 + _2;
but not the store from the CTOR:
_7 = {_24, _3};
MEM[(double4 *)a_11(D) + ivtmp.14_22 * 1] = _7;
forwprop would also split that, but we have
else if (code == CONSTRUCTOR
&& VECTOR_TYPE_P (TREE_TYPE (rhs))
&& TYPE_MODE (TREE_TYPE (rhs)) == BLKmode
&& CONSTRUCTOR_NELTS (rhs) > 0
&& (!VECTOR_TYPE_P (TREE_TYPE (CONSTRUCTOR_ELT (rhs,
0)->value))
|| (TYPE_MODE (TREE_TYPE (CONSTRUCTOR_ELT (rhs,
0)->value))
!= BLKmode)))
{
/* Rewrite stores of a single-use vector constructors
to component-wise stores if the mode isn't supported. */
use_operand_p use_p;
gimple *use_stmt;
if (single_imm_use (lhs, &use_p, &use_stmt)
&& gimple_store_p (use_stmt)
&& !gimple_has_volatile_ops (use_stmt)
&& !stmt_can_throw_internal (fun, use_stmt)
&& is_gimple_assign (use_stmt)
&& (TREE_CODE (gimple_assign_lhs (use_stmt))
!= TARGET_MEM_REF))
and in this case there's a TARGET_MEM_REF on the LHS. With -fno-ivopts we
get
.L2:
movslq %ecx, %rax
addl $4, %ecx
salq $3, %rax
leaq (%rdi,%rax), %rdx
addq %rsi, %rax
movapd 16(%rax), %xmm0
movapd (%rdx), %xmm1
addpd 16(%rdx), %xmm0
addpd (%rax), %xmm1
movaps %xmm0, 16(%rdx)
movaps %xmm1, (%rdx)
subl $1, %r8d
jne .L2
We could use the same trick as optimize_vector_load and instead of a
TARGET_MEM_REF memory reference use that only as address generation.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/64731] vector lowering should split loads and stores
2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
` (5 preceding siblings ...)
2023-05-12 6:42 ` rguenth at gcc dot gnu.org
@ 2023-05-12 13:04 ` cvs-commit at gcc dot gnu.org
2023-05-12 13:05 ` rguenth at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-05-12 13:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731
--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
https://gcc.gnu.org/g:cc0e22b3f25d4b2a326322bce711179c02377e6c
commit r14-788-gcc0e22b3f25d4b2a326322bce711179c02377e6c
Author: Richard Biener <rguenther@suse.de>
Date: Fri May 12 13:43:27 2023 +0200
tree-optimization/64731 - extend store-from CTOR lowering to TARGET_MEM_REF
The following also covers TARGET_MEM_REF when decomposing stores from
CTORs to supported elementwise operations. This avoids spilling
and cleans up after vector lowering which doesn't touch loads or
stores. It also mimics what we already do for loads.
PR tree-optimization/64731
* tree-ssa-forwprop.cc (pass_forwprop::execute): Also
handle TARGET_MEM_REF destinations of stores from vector
CTORs.
* gcc.target/i386/pr64731.c: New testcase.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/64731] vector lowering should split loads and stores
2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
` (6 preceding siblings ...)
2023-05-12 13:04 ` cvs-commit at gcc dot gnu.org
@ 2023-05-12 13:05 ` rguenth at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-12 13:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
Target Milestone|--- |14.0
Known to fail| |13.1.0
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed for GCC 14.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-05-12 13:05 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
2015-01-22 16:17 ` [Bug tree-optimization/64731] vector lowering should split loads and stores rguenth at gcc dot gnu.org
2015-01-22 16:23 ` jakub at gcc dot gnu.org
2015-01-22 16:29 ` rguenther at suse dot de
2015-01-22 18:21 ` glisse at gcc dot gnu.org
2023-05-12 4:38 ` pinskia at gcc dot gnu.org
2023-05-12 6:42 ` rguenth at gcc dot gnu.org
2023-05-12 13:04 ` cvs-commit at gcc dot gnu.org
2023-05-12 13:05 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).