public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/98291] New: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert
@ 2020-12-15 15:51 peter at cordes dot ca
2021-01-04 8:52 ` [Bug tree-optimization/98291] " rguenth at gcc dot gnu.org
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: peter at cordes dot ca @ 2020-12-15 15:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98291
Bug ID: 98291
Summary: multiple scalar FP accumulators auto-vectorize worse
than scalar, including vector load + merge instead of
scalar + high-half insert
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Keywords: missed-optimization, ssemmx
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
An FP reduction loop with 2 scalar accumulators auto-vectorizes into a mess,
instead of effectively mapping each scalar to an element of one vector
accumulator. (Unless we use -ffast-math, then that happens. clang gets it
right even without -ffast-math).
double dotprod(const double *a, const double *b, unsigned long long n)
{
double d1 = 0.0;
double d2 = 0.0;
for (unsigned long long i = 0; i < n; i += 2) {
d1 += a[i] * b[i];
d2 += a[i + 1] * b[i + 1];
}
return (d1 + d2);
}
https://godbolt.org/z/Kq48j9
With -ffast-math the nice sane loop we expect
.L3:
movupd (%rsi,%rax), %xmm0
movupd (%rdi,%rax), %xmm3
addq $1, %rdx
addq $16, %rax
mulpd %xmm3, %xmm0
addpd %xmm0, %xmm1
cmpq %rcx, %rdx
jb .L3
without:
...
main loop
.L4:
movupd (%rcx,%rax), %xmm1 # 16-byte load
movupd (%rsi,%rax), %xmm3
movhpd 16(%rcx,%rax), %xmm1 # overwrite the high half of it!!
movhpd 16(%rsi,%rax), %xmm3
mulpd %xmm3, %xmm1
movupd 16(%rsi,%rax), %xmm3
movlpd 8(%rsi,%rax), %xmm3
addsd %xmm1, %xmm2
unpckhpd %xmm1, %xmm1
addsd %xmm1, %xmm2
movupd 16(%rcx,%rax), %xmm1
movlpd 8(%rcx,%rax), %xmm1
addq $32, %rax
mulpd %xmm3, %xmm1
addsd %xmm1, %xmm0
unpckhpd %xmm1, %xmm1
addsd %xmm1, %xmm0
cmpq %rdx, %rax
jne .L4
The overall strategy is insane, but even some of the details are insane. e.g.
a 16-byte load into XMM1, and then overwriting the high half of that with a
different double before reading it. That's bad enough, but you'd expect movsd
/ movhpd to manually gather 2 doubles, without introducing the possibility of a
cache-line split load for zero benefit.
Similarly, movupd / movlpd should have just loaded in the other order. (Or
since they're contiguous, movupd 8(%rsi,%rax), %xmm3 / shufpd.)
So beyond the bad overall strategy (which is likely worse than unrolled
scalar), it might be worth checking for some of this kind of smaller-scale
insanity somewhere later to make it less bad if some other inputs can trigger
similar behaviour.
(This small-scale detecting of movupd / movhpd and using movsd / movhpd could
be a separate bug, but if it's just a symptom of something that should never
happen in the first place then it's not really its own bug at all.)
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/98291] multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert
2020-12-15 15:51 [Bug tree-optimization/98291] New: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert peter at cordes dot ca
@ 2021-01-04 8:52 ` rguenth at gcc dot gnu.org
2021-01-04 9:48 ` cvs-commit at gcc dot gnu.org
2021-01-04 9:48 ` rguenth at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-04 8:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98291
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Blocks| |53947
Last reconfirmed| |2021-01-04
Status|UNCONFIRMED |ASSIGNED
Ever confirmed|0 |1
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
I think this is a missed optimization in SLP reduction vectorization. We're
detecting the optimal vectorization opportunity but:
x.c:6:36: note: ==> examining statement: d2_30 = PHI <d2_25(7), 0.0(6)>
x.c:6:36: note: vect_is_simple_use: operand _11 * _14, type of def: internal
x.c:6:36: note: vect_is_simple_use: operand d2_30 = PHI <d2_25(7), 0.0(6)>,
type of def: reduction
x.c:6:36: missed: reduc op not supported by target.
x.c:6:36: missed: in-order unchained SLP reductions not supported.
x.c:1:8: missed: not vectorized: relevant stmt not supported: d2_30 = PHI
<d2_25(7), 0.0(6)>
x.c:6:36: note: removing SLP instance operations starting from: d1_24 = _7 +
d1_28;
x.c:6:36: missed: unsupported SLP instances
x.c:6:36: note: re-trying with SLP disabled
and end up vectorizing two reductions on interleaved data (ugh).
The fact we fail to notice is that there's no reduction needed in the
epilogue (VF == 1) and the reduction is vectorized in-order already.
if (reduction_type == FOLD_LEFT_REDUCTION
&& slp_node
&& !REDUC_GROUP_FIRST_ELEMENT (stmt_info))
{
/* We cannot use in-order reductions in this case because there is
an implicit reassociation of the operations involved. */
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"in-order unchained SLP reductions not supported.\n");
return false;
If we allow VF == 1 here we end up with odd (wrong?) code so I guess we
shouldn't classify a VF == 1 SLP reduction as FOLD_LEFT_REDUCTION in the
first place but IIRC this classification is done before the VF is fixed.
Mine.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/98291] multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert
2020-12-15 15:51 [Bug tree-optimization/98291] New: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert peter at cordes dot ca
2021-01-04 8:52 ` [Bug tree-optimization/98291] " rguenth at gcc dot gnu.org
@ 2021-01-04 9:48 ` cvs-commit at gcc dot gnu.org
2021-01-04 9:48 ` rguenth at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-01-04 9:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98291
--- Comment #2 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:
https://gcc.gnu.org/g:8837f82e4bab1b5405cf034eab9b3e83afc563ad
commit r11-6434-g8837f82e4bab1b5405cf034eab9b3e83afc563ad
Author: Richard Biener <rguenther@suse.de>
Date: Mon Jan 4 09:53:11 2021 +0100
tree-optimization/98291 - allow SLP more vectorization of reductions
When the VF is one a SLP reduction is in-order and thus we can
vectorize even when the reduction op is not associative.
2021-01-04 Richard Biener <rguenther@suse.de>
PR tree-optimization/98291
* tree-vect-loop.c (vectorizable_reduction): Bypass
associativity check for SLP reductions with VF 1.
* gcc.dg/vect/slp-reduc-11.c: New testcase.
* gcc.dg/vect/vect-reduc-in-order-4.c: Adjust.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/98291] multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert
2020-12-15 15:51 [Bug tree-optimization/98291] New: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert peter at cordes dot ca
2021-01-04 8:52 ` [Bug tree-optimization/98291] " rguenth at gcc dot gnu.org
2021-01-04 9:48 ` cvs-commit at gcc dot gnu.org
@ 2021-01-04 9:48 ` rguenth at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-04 9:48 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98291
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed on trunk.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2021-01-04 9:48 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-15 15:51 [Bug tree-optimization/98291] New: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert peter at cordes dot ca
2021-01-04 8:52 ` [Bug tree-optimization/98291] " rguenth at gcc dot gnu.org
2021-01-04 9:48 ` cvs-commit at gcc dot gnu.org
2021-01-04 9:48 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).