public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/87077] missed optimization for horizontal add for x86 SSE
[not found] <bug-87077-4@http.gcc.gnu.org/bugzilla/>
@ 2021-08-03 0:52 ` pinskia at gcc dot gnu.org
2021-08-03 6:56 ` rguenth at gcc dot gnu.org
2021-08-03 7:09 ` rguenth at gcc dot gnu.org
2 siblings, 0 replies; 3+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-03 0:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87077
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Bug target/87077] missed optimization for horizontal add for x86 SSE
[not found] <bug-87077-4@http.gcc.gnu.org/bugzilla/>
2021-08-03 0:52 ` [Bug target/87077] missed optimization for horizontal add for x86 SSE pinskia at gcc dot gnu.org
@ 2021-08-03 6:56 ` rguenth at gcc dot gnu.org
2021-08-03 7:09 ` rguenth at gcc dot gnu.org
2 siblings, 0 replies; 3+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-03 6:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87077
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
We are now vectorizing the outer loop with the inner loop being unrolled.
If you add #pragma GCC unroll 0 to the inner loop we get comparatively good
code, but we reduce to scalar 4 times.
If you add #pragma GCC unroll 4 to both loops we apply BB vectorization
which expands the reductions in suboptimal way - it now also detects the
reductions but they are covered by the BB vectorization we recognize
for the store of the reduction results.
Note haddp[sd] is slow.
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Bug target/87077] missed optimization for horizontal add for x86 SSE
[not found] <bug-87077-4@http.gcc.gnu.org/bugzilla/>
2021-08-03 0:52 ` [Bug target/87077] missed optimization for horizontal add for x86 SSE pinskia at gcc dot gnu.org
2021-08-03 6:56 ` rguenth at gcc dot gnu.org
@ 2021-08-03 7:09 ` rguenth at gcc dot gnu.org
2 siblings, 0 replies; 3+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-03 7:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87077
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Just to quote, with the inner loop forced not unrolled we get
<bb 2> [local count: 53687093]:
vect__1.11_14 = MEM <const vector(4) float> [(float *)mtx_12(D)];
vect__2.14_15 = MEM <const vector(4) float> [(float *)vec_13(D)];
vect__3.15_21 = vect__1.11_14 * vect__2.14_15;
_37 = .REDUC_PLUS (vect__3.15_21);
vectp_mtx.10_46 = mtx_12(D) + 32;
vect__1.11_47 = MEM <const vector(4) float> [(float *)vectp_mtx.10_46];
vect__3.15_49 = vect__2.14_15 * vect__1.11_47;
_52 = .REDUC_PLUS (vect__3.15_49);
vectp_mtx.10_61 = mtx_12(D) + 64;
vect__1.11_62 = MEM <const vector(4) float> [(float *)vectp_mtx.10_61];
vect__3.15_64 = vect__2.14_15 * vect__1.11_62;
_67 = .REDUC_PLUS (vect__3.15_64);
vectp_mtx.10_17 = mtx_12(D) + 96;
vect__1.11_5 = MEM <const vector(4) float> [(float *)vectp_mtx.10_17];
vect__3.15_30 = vect__1.11_5 * vect__2.14_15;
_33 = .REDUC_PLUS (vect__3.15_30);
so 4 optimal inner loop executions
_27 = {_37, _52, _67, _33};
MEM <vector(4) float> [(float *)&<retval>] = _27;
the BB store vectorized.
This results in
vmovaps (%rdx), %xmm1
vmulps (%rsi), %xmm1, %xmm0
movq %rdi, %rax
vmovhlps %xmm0, %xmm0, %xmm2
vaddps %xmm0, %xmm2, %xmm2
vshufps $85, %xmm2, %xmm2, %xmm0
vaddps %xmm2, %xmm0, %xmm0
vmulps 32(%rsi), %xmm1, %xmm2
vmovhlps %xmm2, %xmm2, %xmm3
vaddps %xmm2, %xmm3, %xmm3
vshufps $85, %xmm3, %xmm3, %xmm2
vaddps %xmm3, %xmm2, %xmm2
vmovaps %xmm2, %xmm3
vmulps 64(%rsi), %xmm1, %xmm2
vunpcklps %xmm3, %xmm0, %xmm0
vmulps 96(%rsi), %xmm1, %xmm1
vmovhlps %xmm2, %xmm2, %xmm4
vaddps %xmm2, %xmm4, %xmm4
vshufps $85, %xmm4, %xmm4, %xmm2
vaddps %xmm4, %xmm2, %xmm2
vmovhlps %xmm1, %xmm1, %xmm4
vaddps %xmm1, %xmm4, %xmm4
vshufps $85, %xmm4, %xmm4, %xmm1
vaddps %xmm4, %xmm1, %xmm1
vunpcklps %xmm1, %xmm2, %xmm2
vmovlhps %xmm2, %xmm0, %xmm0
vmovaps %xmm0, (%rdi)
ret
which I think is quite optimal - using hadd would likely be slower if not
for cleverly re-using its permutation handling.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-08-03 7:09 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <bug-87077-4@http.gcc.gnu.org/bugzilla/>
2021-08-03 0:52 ` [Bug target/87077] missed optimization for horizontal add for x86 SSE pinskia at gcc dot gnu.org
2021-08-03 6:56 ` rguenth at gcc dot gnu.org
2021-08-03 7:09 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).