public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once
@ 2021-04-08 14:41 andysem at mail dot ru
2021-04-08 14:45 ` [Bug tree-optimization/99971] " andysem at mail dot ru
` (10 more replies)
0 siblings, 11 replies; 12+ messages in thread
From: andysem at mail dot ru @ 2021-04-08 14:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
Bug ID: 99971
Summary: GCC generates partially vectorized and scalar code at
once
Product: gcc
Version: 10.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: andysem at mail dot ru
Target Milestone: ---
Consider the following code sample:
struct A
{
unsigned int a, b, c, d;
A& operator+= (A const& that)
{
a += that.a;
b += that.b;
c += that.c;
d += that.d;
return *this;
}
A& operator-= (A const& that)
{
a -= that.a;
b -= that.b;
c -= that.c;
d -= that.d;
return *this;
}
};
void test(A& x, A const& y1, A const& y2)
{
x += y1;
x -= y2;
}
The code, when compiled with options "-O3 -march=nehalem", generates:
test(A&, A const&, A const&):
pushq %rbp
movdqu (%rdi), %xmm1
pushq %rbx
movl 4(%rsi), %r8d
movdqu (%rsi), %xmm0
movl (%rsi), %r9d
paddd %xmm1, %xmm0
movl 8(%rsi), %ecx
movl 12(%rsi), %eax
movl %r8d, %esi
movl (%rdi), %ebp
movl 4(%rdi), %ebx
movl 8(%rdi), %r11d
movl 12(%rdi), %r10d
movups %xmm0, (%rdi)
subl (%rdx), %r9d
subl 4(%rdx), %esi
subl 8(%rdx), %ecx
subl 12(%rdx), %eax
addl %ebp, %r9d
addl %ebx, %esi
movl %r9d, (%rdi)
popq %rbx
addl %r11d, %ecx
popq %rbp
movl %esi, 4(%rdi)
addl %r10d, %eax
movl %ecx, 8(%rdi)
movl %eax, 12(%rdi)
ret
https://gcc.godbolt.org/z/Mzchj8bxG
Here you can see that the compiler has partially vectorized the test function -
it converted "x += y1" to paddd, as expected, but failed to vectorize "x -=
y2". But at the same time the compiler also generated scalar code, including
for the already vectorized "x += y1" line, basically duplicating it.
Note that when either "x += y1" or "x -= y2" is commented, the compiler is able
to vectorize the line that is left. It is also able to vectorize both lines
when the += and -= operators are applied to different objects instead of x.
This is reproducible since gcc 8 up to and including 10.2. gcc 7 doesn't
vectorize this code. With the current trunk on godbolt the generated code is
different:
test(A&, A const&, A const&):
movdqu (%rsi), %xmm0
movdqu (%rdi), %xmm1
paddd %xmm1, %xmm0
movups %xmm0, (%rdi)
movd %xmm0, %eax
subl (%rdx), %eax
movl %eax, (%rdi)
pextrd $1, %xmm0, %eax
subl 4(%rdx), %eax
movl %eax, 4(%rdi)
pextrd $2, %xmm0, %eax
subl 8(%rdx), %eax
movl %eax, 8(%rdi)
pextrd $3, %xmm0, %eax
subl 12(%rdx), %eax
movl %eax, 12(%rdi)
ret
Here the compiler is able to vectorize "x += y1" but not "x -= y2". At least,
it removed the duplicate scalar version of "x += y1".
Given that the compiler is able to vectorize each line in isolation, I would
expect it to be able to vectorize them combined. Generating duplicate versions
of code is certainly not expected.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
@ 2021-04-08 14:45 ` andysem at mail dot ru
2021-04-09 7:05 ` rguenth at gcc dot gnu.org
` (9 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: andysem at mail dot ru @ 2021-04-08 14:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
--- Comment #1 from andysem at mail dot ru ---
For reference, an ideal version of this code should look something like this:
test(A&, A const&, A const&):
movdqu (%rsi), %xmm0
movdqu (%rdi), %xmm1
movdqu (%rdx), %xmm2
paddd %xmm1, %xmm0
psubd %xmm2, %xmm0
movups %xmm0, (%rdi)
ret
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
2021-04-08 14:45 ` [Bug tree-optimization/99971] " andysem at mail dot ru
@ 2021-04-09 7:05 ` rguenth at gcc dot gnu.org
2021-04-15 9:15 ` andysem at mail dot ru
` (8 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-09 7:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
Ever confirmed|0 |1
Last reconfirmed| |2021-04-09
Status|UNCONFIRMED |ASSIGNED
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed. While we manage to analyze for the "perfect" solution" we fail
because dependence testing doesn't handle a piece, this throws away half
of the vectorization. We do actually see that we'll retain the scalar
loads and computations but still doing three vector loads and a vector add
seems cheaper than doing four scalar stores:
0x1fdb5a0 x_2(D)->a 1 times unaligned_load (misalign -1) costs 12 in body
0x1fdb5a0 y1_3(D)->a 1 times unaligned_load (misalign -1) costs 12 in body
0x1fdb5a0 _13 + _14 1 times vector_stmt costs 4 in body
0x1fdb5a0 _15 1 times unaligned_store (misalign -1) costs 12 in body
0x1fddcb0 _15 1 times scalar_store costs 12 in body
0x1fddcb0 _18 1 times scalar_store costs 12 in body
0x1fddcb0 _21 1 times scalar_store costs 12 in body
0x1fddcb0 _24 1 times scalar_store costs 12 in body
t.C:28:1: note: Cost model analysis:
Vector inside of basic block cost: 40
Vector prologue cost: 0
Vector epilogue cost: 0
Scalar cost of basic block: 48
t.C:28:1: note: Basic block will be vectorized using SLP
now, fortunately GCC 11 will improve on this [a bit] and we'll produce
_Z4testR1ARKS_S2_:
.LFB2:
.cfi_startproc
movdqu (%rsi), %xmm0
movdqu (%rdi), %xmm1
paddd %xmm1, %xmm0
movups %xmm0, (%rdi)
movd %xmm0, %eax
subl (%rdx), %eax
movl %eax, (%rdi)
pextrd $1, %xmm0, %eax
subl 4(%rdx), %eax
movl %eax, 4(%rdi)
pextrd $2, %xmm0, %eax
subl 8(%rdx), %eax
movl %eax, 8(%rdi)
pextrd $3, %xmm0, %eax
subl 12(%rdx), %eax
movl %eax, 12(%rdi)
ret
which is not re-doing the scalar loads/adds but instead uses the vector
result. Still the same dependence issue is present:
t.C:16:11: missed: can't determine dependence between y1_3(D)->b and
x_2(D)->a
t.C:16:11: note: removing SLP instance operations starting from: x_2(D)->a =
_6;
the scalar code before vectorization looks like
<bb 2> [local count: 1073741824]:
_13 = x_2(D)->a;
_14 = y1_3(D)->a;
_15 = _13 + _14;
x_2(D)->a = _15;
_16 = x_2(D)->b;
_17 = y1_3(D)->b; <---
_18 = _16 + _17;
x_2(D)->b = _18;
_19 = x_2(D)->c;
_20 = y1_3(D)->c;
_21 = _19 + _20;
x_2(D)->c = _21;
_22 = x_2(D)->d;
_23 = y1_3(D)->d;
_24 = _22 + _23;
x_2(D)->d = _24;
_5 = y2_4(D)->a;
_6 = _15 - _5;
x_2(D)->a = _6; <---
_7 = y2_4(D)->b;
_8 = _18 - _7;
x_2(D)->b = _8;
_9 = y2_4(D)->c;
_10 = _21 - _9;
x_2(D)->c = _10;
_11 = y2_4(D)->d;
_12 = _24 - _11;
x_2(D)->d = _12;
return;
Using
void test(A& __restrict x, A const& y1, A const& y2)
{
x += y1;
x -= y2;
}
produces optimal assembly even with GCC 10:
_Z4testR1ARKS_S2_:
.LFB2:
.cfi_startproc
movdqu (%rsi), %xmm0
movdqu (%rdx), %xmm1
movdqu (%rdi), %xmm2
psubd %xmm1, %xmm0
paddd %xmm2, %xmm0
movups %xmm0, (%rdi)
ret
note that I think we should be able to handle the dependences even without
the __restrict annotation.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
2021-04-08 14:45 ` [Bug tree-optimization/99971] " andysem at mail dot ru
2021-04-09 7:05 ` rguenth at gcc dot gnu.org
@ 2021-04-15 9:15 ` andysem at mail dot ru
2021-04-15 11:26 ` rguenth at gcc dot gnu.org
` (7 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: andysem at mail dot ru @ 2021-04-15 9:15 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
--- Comment #3 from andysem at mail dot ru ---
I tried adding __restrict__ to the equivalents of x, y1 and y2 in the original
larger code base and it didn't help. The compiler (gcc 10.2) would still
generate the same half-vectorized code.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
` (2 preceding siblings ...)
2021-04-15 9:15 ` andysem at mail dot ru
@ 2021-04-15 11:26 ` rguenth at gcc dot gnu.org
2021-04-15 11:30 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-15 11:26 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to andysem from comment #3)
> I tried adding __restrict__ to the equivalents of x, y1 and y2 in the
> original larger code base and it didn't help. The compiler (gcc 10.2) would
> still generate the same half-vectorized code.
Hmm, that's odd. I suppose the equivalent of test() was inlined in the
larger code base?
I'd be interested in preprocessed source of a translation unit that exhibits
this issue (and a pointer to the point in the source that is relevant).
Note for GCC 12 I have a patch to improve things w/o requiring the use
of __restrict (and I'm curious on whether that helps for the larger code base).
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
` (3 preceding siblings ...)
2021-04-15 11:26 ` rguenth at gcc dot gnu.org
@ 2021-04-15 11:30 ` rguenth at gcc dot gnu.org
2021-04-15 16:01 ` andysem at mail dot ru
` (5 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-15 11:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #4)
> (In reply to andysem from comment #3)
> > I tried adding __restrict__ to the equivalents of x, y1 and y2 in the
> > original larger code base and it didn't help. The compiler (gcc 10.2) would
> > still generate the same half-vectorized code.
>
> Hmm, that's odd. I suppose the equivalent of test() was inlined in the
> larger code base?
>
> I'd be interested in preprocessed source of a translation unit that exhibits
> this issue (and a pointer to the point in the source that is relevant).
>
> Note for GCC 12 I have a patch to improve things w/o requiring the use
> of __restrict (and I'm curious on whether that helps for the larger code
> base).
https://gcc.gnu.org/pipermail/gcc-patches/2021-April/567805.html
is the patch which applies to current master.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
` (4 preceding siblings ...)
2021-04-15 11:30 ` rguenth at gcc dot gnu.org
@ 2021-04-15 16:01 ` andysem at mail dot ru
2021-04-15 23:17 ` david.bolvansky at gmail dot com
` (4 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: andysem at mail dot ru @ 2021-04-15 16:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
--- Comment #6 from andysem at mail dot ru ---
Hmm, it looks like the original code has changed enough so that the problem no
longer reproduces, with or without __restrict__. I don't have the older version
of the code, so I can't tell what changed exactly. Data alignment most probably
did change, but data layout of A (its equivalent in the original code) as well
as the operation on it certainly didn't. Sorry for the confusion.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
` (5 preceding siblings ...)
2021-04-15 16:01 ` andysem at mail dot ru
@ 2021-04-15 23:17 ` david.bolvansky at gmail dot com
2021-04-23 7:35 ` cvs-commit at gcc dot gnu.org
` (3 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: david.bolvansky at gmail dot com @ 2021-04-15 23:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
Dávid Bolvanský <david.bolvansky at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |david.bolvansky at gmail dot com
--- Comment #7 from Dávid Bolvanský <david.bolvansky at gmail dot com> ---
Still bad for -O3 -march=skylake-avx512
https://godbolt.org/z/azb8aTG43
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
` (6 preceding siblings ...)
2021-04-15 23:17 ` david.bolvansky at gmail dot com
@ 2021-04-23 7:35 ` cvs-commit at gcc dot gnu.org
2021-04-23 7:37 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-04-23 7:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
https://gcc.gnu.org/g:700e542971251b11623cce877075567815f72965
commit r12-79-g700e542971251b11623cce877075567815f72965
Author: Richard Biener <rguenther@suse.de>
Date: Fri Apr 9 09:35:51 2021 +0200
tree-optimization/99971 - improve BB vect dependence analysis
We can use TBAA even when we have a DR, do so. For the testcase
that means fully vectorizing it instead of only vectorizing
the first store group resulting in suboptimal code.
2021-04-09 Richard Biener <rguenther@suse.de>
PR tree-optimization/99971
* tree-vect-data-refs.c (vect_slp_analyze_node_dependences):
Always use TBAA for loads.
* g++.dg/vect/slp-pr99971.cc: New testcase.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
` (7 preceding siblings ...)
2021-04-23 7:35 ` cvs-commit at gcc dot gnu.org
@ 2021-04-23 7:37 ` rguenth at gcc dot gnu.org
2021-04-23 8:43 ` andysem at mail dot ru
2021-04-23 9:03 ` rguenther at suse dot de
10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-23 7:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
Known to work| |12.0
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed for GCC 12.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
` (8 preceding siblings ...)
2021-04-23 7:37 ` rguenth at gcc dot gnu.org
@ 2021-04-23 8:43 ` andysem at mail dot ru
2021-04-23 9:03 ` rguenther at suse dot de
10 siblings, 0 replies; 12+ messages in thread
From: andysem at mail dot ru @ 2021-04-23 8:43 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
--- Comment #10 from andysem at mail dot ru ---
Thanks. Will this be backported to 10 and 11 branches?
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Bug tree-optimization/99971] GCC generates partially vectorized and scalar code at once
2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
` (9 preceding siblings ...)
2021-04-23 8:43 ` andysem at mail dot ru
@ 2021-04-23 9:03 ` rguenther at suse dot de
10 siblings, 0 replies; 12+ messages in thread
From: rguenther at suse dot de @ 2021-04-23 9:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
--- Comment #11 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 23 Apr 2021, andysem at mail dot ru wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99971
>
> --- Comment #10 from andysem at mail dot ru ---
> Thanks. Will this be backported to 10 and 11 branches?
I don't plan to since it isn't a regression as far as I know, it
doesn't apply to GCC 10 so definitely not there. I'll consider
for GCC 11.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2021-04-23 9:03 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-08 14:41 [Bug tree-optimization/99971] New: GCC generates partially vectorized and scalar code at once andysem at mail dot ru
2021-04-08 14:45 ` [Bug tree-optimization/99971] " andysem at mail dot ru
2021-04-09 7:05 ` rguenth at gcc dot gnu.org
2021-04-15 9:15 ` andysem at mail dot ru
2021-04-15 11:26 ` rguenth at gcc dot gnu.org
2021-04-15 11:30 ` rguenth at gcc dot gnu.org
2021-04-15 16:01 ` andysem at mail dot ru
2021-04-15 23:17 ` david.bolvansky at gmail dot com
2021-04-23 7:35 ` cvs-commit at gcc dot gnu.org
2021-04-23 7:37 ` rguenth at gcc dot gnu.org
2021-04-23 8:43 ` andysem at mail dot ru
2021-04-23 9:03 ` rguenther at suse dot de
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).